You are on page 1of 265

Micro-Partitions: Understanding

Snowflake's File Structure


Date
Thursday, November 3, 2022

Niall Woodward

Co-founder & CTO of SELECT

Independently scalable compute and storage is an architecture


fundamental of Snowflake. In this post, we’ll be focusing on how Snowflake
stores data, and how it can greatly accelerate query performance.

Snowflake Architecture Refresher


In a previous post, we discussed how Snowflake’s architecture is split into
three horizontal slices. Firstly there is a cloud services layer, which is a very
broad category for all Snowflake’s features outside of query execution.
Cloud services interact with the massively parallel query processing layer.
The virtual warehouses in the query processing layer read and write data
from the storage layer, which is S3 for the majority of Snowflake customers
(most Snowflake accounts run on AWS). Micro-partitions are found in this
storage layer.
Micro-partitions, partitioning, clustering - what
does it all mean?!
Confusingly, there are a number of similar but distinct terms associated
with micro-partitions. Here’s a glossary:

1. Micro-partitions (the focus of this post) - the unit of storage in


Snowflake. A micro-partition put simply is a fancy kind of file. These
are sometimes just referred to as partitions.
2. Clustering - describes the distribution of data across micro-partitions
for a particular table.
3. Clustered - All tables are clustered in the sense that the data is stored
in one or more micro-partitions, but the Snowflake
documentation defines this as "A table with a clustering key defined
is considered to be clustered."
4. Well-clustered - Well-clustered describes a table that prunes well for
the typical queries that are executed on it. Beware though that a table
that is well-clustered is not necessarily clustered per the above
definition.
5. Clustering key - a clustering key or expression can be defined for a
table that turns on Snowflake’s automatic clustering service. The
automatic clustering service is a billable, serverless process that
rearranges data in the micro-partitions to conform to the clustering
key specified.
6. Partitioning - partitioning doesn’t have a definition within the context
of Snowflake.
7. Warehouse cluster - the scaling unit in a multi-clustered warehouse.

What is a Snowflake micro-partition?


A micro-partition is a file, stored in the blob storage service for the cloud
service provider on which a Snowflake account runs:

 AWS - S3
 Azure - Azure Blob Storage
 GCP - Google Cloud Storage

Micro-partitions use a proprietary, closed-source file format created by


Snowflake. They contain a header enclosing metadata describing the data
stored, with the actual data grouped by column and stored in a
compressed format.

A single micro-partition can contain up to 16MB of compressed data (this is


where the same variant constraint comes from), that uncompressed is
typically between 50 and 500MB. Small tables (<500MB uncompressed)
may only have a single micro-partition, and as Snowflake has no limit on
table size, there is by extension no limit on the number of micro-partitions
a single table can have.

A micro-partitions always contains complete rows of data. This is potentially


confusing, as micro-partitions are also a columnar storage format. These
two attributes are not contradictory though, as individual columns are
retrievable from a micro-partition. You’ll sometimes hear micro-partitions
described as ‘hybrid columnar’, due to the grouping of data by both rows (a
micro-partition) and columns (within each micro-partition).

Ok, enough talking, time for a diagram. Throughout the post, we will use
the example of an orders table. Here's what one of the micro-partitions
stored for the table looks like:

In the header, the byte ranges for each column within the micro-partition
are stored, allowing Snowflake to retrieve only the columns relevant to a
query using a byte range get. This is why queries will run faster if you
reduce the number of columns selected.

Metadata about each column for the micro-partition is stored by Snowflake


in its metadata cache in the cloud services layer. This metadata is used to
provide extremely fast results for basic analytical queries such
as count(*) and max(column). Additional metadata is also stored but not
shown here, some of which are undocumented, described by Snowflake
as 'used for both optimization and efficient query processing'.
The min and max values crucially provide Snowflake with the ability to
perform pruning.

What is Snowflake query pruning?


Pruning is a technique employed by Snowflake to reduce the number of
micro-partitions read when executing a query. Reading micro-partitions is
one of the costliest steps in a query, as it involves reading data remotely
over the network. If a filter is applied in a where clause, join, or subquery,
Snowflake will attempt to eliminate any micro-partitions it knows don’t
contain relevant data. For this to work, the micro-partitions have to contain
a narrow range of values for the column you're filtering on.

Let's zoom out to the entire orders table. In this example, the table contains
28 micro-partitions, each with three rows of data (in practice, a micro-
partition typically contains hundreds of thousands of rows).

This example table is sorted and therefore well-clustered on


the created_at column (each micro-partition has a narrow range of values for
that column). A user runs the following query on the table:
select *

from orders

where created_at > '2022/08/14'

Snowflake checks as part of the query planning process which micro-


partitions contain data relevant to the query. In this case, only orders which
were created after 2022/08/14 are needed. The query planner quickly
identifies these records as present in only the first three micro-partitions
highlighted in the diagram, using the min and max metadata for
the created_at column. The rest of the micro-partitions are ignored (pruned),
and Snowflake has reduced the amount of data it needs to read to only a
small subset of the table.

Identifying query pruning performance


Snowflake's query profile displays lots of valuable information, including
pruning performance. It's not possible to determine the pruning statistics
for a query before it has been executed. Note that some queries do not
require a table scan, in which case no pruning statistics will be shown.

Classic web interface


In the History or Worksheets page, click a query ID. You'll be taken to the
Details page for that query. Next, click the Profile tab, and if displayed, the
last step number to view all the stats for the query. On the right-hand pane,
a Total Statistics section contains a Pruning subheading. Two values are
displayed, partitions scanned, and partitions total.
Snowsight web interface
After executing a query, a Query Details pane appears to the right of the
query results section. Click the three dots, and then View Query Profile. To
view the profile for a query executed previously, click the Activity button in
the left navigation bar, followed by Query History. Next, click the query of
interest, and then the Query Profile tab. Partitions scanned and partitions
total values are then displayed.

Interpreting the results


The partitions scanned value represents the number of partitions that have
been read by the query. The partitions total value represents the total
number of micro-partitions in existence for the tables selected from in the
query. A query that demonstrates good pruning has a small number of
partitions scanned in comparison to partitions total. The profiles shown
above demonstrate a query with no pruning whatsoever, as the number of
partitions scanned equals the total number of partitions. If we however add
a filter to the query that Snowflake can use its micro-partition metadata to
effectively prune on, we see significantly improved results.

In summary
We've learned what micro-partitions are, how Snowflake uses them for
query optimization through pruning, and the various ways to measure
pruning performance.

==============\

3 Ways to Achieve Effective Clustering


in Snowflake
Date
Saturday, November 12, 2022

Niall Woodward

Co-founder & CTO of SELECT

In our previous post on micro-partitions, we dove into how Snowflake's


unique storage format enables a query optimization called pruning. Pairing
query design with effective clustering can dramatically improve pruning and
therefore query speeds. We'll explore how and when you should leverage
this powerful Snowflake feature.

What is Snowflake clustering?


Clustering describes the distribution of data across micro-partitions, the
unit of storage in Snowflake, for a particular table. When a table is well-
clustered, Snowflake can leverage the metadata from each micro-partition
to minimize the numbers of files the query must scan, greatly improving
query performance. Due to this behaviour, clustering is one of the most
powerful optimization techniques Snowflake users can use to improve
performance and lower costs.

Let's explore this concept with an example.

Example of a well clustered table


In the diagram below, we have a hypothetical orders table that is well-
clustered on the created_at column, as rows with similar created_at values
are located in the same micro-partitions.
Snowflake maintains minimum and maximum value metadata for each
column in each micro-partition. In this table, each micro-partition contains
records for a narrow range of created_at values, so the table is well-
clustered on the column. The following query only scans the first three
micro-partitions highlighted, as Snowflake knows it can ignore the rest
based on the where clause and minimum and maximum value micro-
partition metadata. This behavior is called query pruning.

select *

from orders

where created_at > '2022/08/14'

Unsurprisingly, the impact of scanning only three micro-partitions instead


of every micro-partition is that the query runs considerably faster.

When should you use clustering?


Most Snowflake users don’t need to consider clustering. If your queries run
fast enough and you’re comfortably under budget, then it’s really not worth
worrying about. But, if you care about performance and/or cost, you
definitely should care about clustering.

Pruning is arguably the most powerful optimization technique available to


Snowflake users, as reducing the amount of data scanned and processed is
such a fundamental principle in big data processing: “The fastest way to
process data? Don’t.”

Snowflake’s documentation suggests that clustering is only beneficial for


tables containing “multiple terabytes (TB) of data”. In our experience,
however, clustering can have performance benefits for tables starting at
hundreds of megabytes (MB).

Choosing a clustering key


To know whether a table is well-clustered for typical queries against it, you
first have to know what those query patterns are.
Snowflake's access_history view provides an easy way of retrieving historic
queries for a particular table.

Frequently used where clause filtering keys are good choices for clustering
keys. For example:

select *

from table_a

where created_at > '2022-09-25'

The above query will benefit from a table that is well-clustered on


the created_at column, as similar values would be contained within the
same micro-partition, resulting in only a small number of micro-partitions
being scanned. This pruning determination is performed by the query
compiler in the cloud services layer, prior to execution taking place.

In practice, we recommend starting by exploring the costliest queries in


your account, which will likely highlight queries that prune micro-partitions
ineffectively despite using filtering. These present opportunities to improve
table clustering.

How to enable clustering in Snowflake?


Once you know what columns you want to cluster on, you'll need to choose
a clustering method. We like to categorize the options into three.

1. Natural clustering
Suppose there is an ETL process adding new events to an events table each
hour. A column inserted_at represents the time at which events are loaded
into the table. Newly created micro-partitions will each have a tightly
bound range of inserted_at values. This events table would be described
to be naturally clustered on the inserted_at column. A query that filters
this table on the inserted_at column will prune micro-partitions effectively.

When performing a backfill of a table that you'd like to leverage natural,


insertion-order clustering on, make sure to sort the data by the natural
clustering key first. That way the historic records are well-clustered, as well
as the new ones that get inserted.

Pros

 No additional expenditure or effort required

Cons

 Only works for queries that filter on a column that correlates to the
order in which data is inserted

2. Automatic clustering service


The automatic clustering service and option 3, manual sorting, involve
sorting a table's data by a particular key. The sorting operation requires
computation, which can either be performed by Snowflake with the
automatic clustering service, or manually. The diagram below uses a date
column to illustrate, but a table can be re-clustered by any
expression/column.
The automatic clustering service uses Snowflake-managed compute
resources to perform the re-clustering operation. This service only runs if a
'clustering key' has been set for a table:

-- you can cluster by one or more comma separated columns

alter table my_table cluster by (column_to_cluster_by);

-- or you can cluster by an expression

alter table my_table cluster by (substring(column_to_cluster_by, 5, 15));

The automatic clustering service performs work in the background to create


and destroy micro-partitions so they contain tightly bound ranges of
records based on the specified clustering key. This service is charged based
on how much work Snowflake performs, which depends on the clustering
key, the size of the table and how frequently its contents are modified.
Consequently, tables that are frequently modified (inserts, updates, deletes)
will incur higher automatic clustering costs. It's worth noting that the
automatic clustering service only uses the first 5 bytes of a column when
performing re-clustering. This means that column values with the same first
few characters won't cause the service to perform any re-clustering.

The automatic clustering service is simple to use, but easy to spend money
with. If you choose to use it, make sure to monitor both the cost and
impact on queries on the table to determine if it achieves a good
price/performance ratio. If you're interested in learning more about the
automatic clustering service, check out this detailed post on the inner
workings by one of Snowflake's engineers.

Pros

 The lowest effort way to cluster on a different key to the natural key.
 Doesn't block or interfere with DML operations.

Cons

 Unpredictable costs.
 Snowflake takes a higher margin on automatic clustering than
warehouse compute costs, which can make automatic clustering less
cost-effective than manual re-sorting.

3. Manual sorting
With fully recreated tables
If a table is always fully recreated as part of a transformation/modeling
process, the table can be perfectly clustered on any key by adding an order
by statement to the create table as (CTAS) query:

create or replace my_table as (

with transformations as (

...

select *

from transformations

order by my_cluster_key

)
In this scenario of a table that is always fully recreated, we recommend
always using manual sorting over the automatic clustering service as the
table will be well-clustered, and at a much lower cost than the automatic
clustering service.

On existing tables
Manually re-sorting an existing table on a particular key simply replaces the
table with a sorted version of itself. Let’s suppose we have a sales table with
entries for lots of different stores, and most queries on the table always
filter for a specific store. We can perform the following query to ensure that
the table is well-clustered on the store_id:

create or replace table sales as (

select * from sales order by store_id

As new sales are added to the table over time, the existing micro-partitions
will remain well-clustered by store_id, but new micro-partitions will
contain records for lots of different stores. That means that older micro-
partitions will prune well, but new micro-partitions won't. Once
performance decreases below acceptable levels, the manual re-sorting
query can be run again to ensure that all the micro-partitions are well-
clustered on store_id.

The benefit of manual re-sorting over the automatic clustering service is


complete control over how frequently the table is re-clustered, and the
associated spend. However, the danger of this approach is that any DML
operations which occur on the table while the create or replace
table operation is running will be undone. Manual re-sorting should only
be used on tables with predictable or pausable DML patterns, where you
can be sure that no DML operations will run while the re-sort is taking
place.

Pros

 Provides complete control over the clustering process.


 Lowest cost way to achieve perfect clustering on any key.
Cons

 Higher effort than the automatic clustering service. Requires the user
to either manually execute the sorting query or implement
automated orchestration of the sorting query.
 Replacing an existing table with a sorted version of itself reverses any
DML operations which run during the re-sort.

Which clustering strategy should you use and


when?
Always aim to leverage natural clustering as by definition it requires no re-
clustering of a table. Transformation processes that use incremental data
processing to only process new/updated data should always use add
an inserted_at or updated_at column for this reason, as these will be
naturally clustered and produce efficient pruning.

It’s common to see that most queries for an organization filter by the same
columns, such as region or store_id. If queries with common filtering
patterns are causing full table scans, then depending on how the table is
populated, consider using automatic clustering or manual re-sorting to
cluster on the filtered column. If you’re not sure how you’d implement
manual re-sorting or there's a risk of DML operations running during the
re-sort, use the automatic clustering service.

Other good candidates for re-clustering are tables queried on a timestamp


column which doesn't always correlate to when the data was inserted, so
natural clustering can't be used. An example of this is an events table which
is frequently queried on event_created_at or similar, but events can arrive
late and so micro-partitions have time range overlap. Re-clustering the
table on the event_created_at will ensure the queries prune well.

Regardless of the clustering approach chosen, it’s always a good idea to


sort data by the desired clustering key before inserting into the table.

Closing
Ultimately, pruning is achieved with complementary query design and table
clustering. The more data, the more powerful pruning is, with the potential
to improve a query's performance by orders of magnitude.

We’ll go deeper on the topic of clustering in future posts, including the use
of Snowflake’s system$clustering_information function to analyze
clustering statistics. We'll also explore options for when a table needs to be
well-clustered on more than one column, so be sure to subscribe to our
mailing list below. Thanks for reading, and please get in touch
via Twitter or email where we'd be happy to answer questions or discuss
these topics in more detail.
Choosing the right warehouse size
in Snowflake
Date
Sunday, November 27, 2022

Niall Woodward

Co-founder & CTO of SELECT

Snowflake users enjoy a lot of flexibility when it comes to compute


configuration. In this post we cover the implications of virtual warehouse
sizing on query speeds, and share some techniques to determine the right
one.

The days of complex and slow cluster resizing are behind us; Snowflake
makes it possible to spin up a new virtual warehouse or resize an existing
one in a matter of seconds. The implications of this are:
1. Significantly reduced compute idling (auto-suspend and scale-in for
multi-cluster warehouses)
2. Better matching of compute power to workloads (ease of
provisioning, de-provisioning and modifying warehouses)

Being able to easily allocate workloads to different warehouse


configurations means faster query run times, improved spend efficiency,
and a better user experience for data teams and their stakeholders. That
leads to the question:

Which warehouse size should I use?

Before we look to answer that question, let's first understand what a virtual
warehouse is, and the impact of size on its available resources and query
processing speed.

What is a virtual warehouse in Snowflake?


Snowflake constructs warehouses from compute nodes. The X-Small uses a
single compute node, a small warehouse uses two nodes, a medium uses
four nodes, and so on. Each node has 8 cores/threads, regardless of cloud
provider. The specifications aren’t published by Snowflake, but it’s fairly well
known that on AWS each node (except for 5XL and 6XL warehouses) is a
c5d.2xlarge EC2 instance, with 16GB of RAM and a 200GB SSD. The
specifications for different cloud providers vary and have been chosen to
provide performance parity across clouds.

While the nodes in each warehouse are physically separated, they operate
in harmony, and Snowflake can utilize all the nodes for a single query.
Consequently, we can work on the basis that each warehouse size increase
doubles the available compute cores, RAM, and disk space available.
What virtual warehouse sizes are available?
Snowflake uses t-shirt sizing names for their warehouses, but unlike t-shirts,
each step up indicates a doubling of resources and credit consumption.
Sizes range from X-Small to 6X-Large. Most Snowflake users will only ever
use the smallest warehouse, the X-Small, as it’s powerful enough for most
datasets up to tens of gigabytes, depending on the complexity of the
workloads.

Warehouse Size Credits

X-Small 1

Small 2

Medium 4
Warehouse Size Credits

Large 8

X-Large 16

2X-Large 32

3X-Large 64

4X-Large 128

5X-Large 256

6X-Large 512
The impact of warehouse size on Snowflake query
speeds
1. Processing power
Snowflake uses parallel processing to execute a query across multiple cores
wherever it is faster to do so. More cores means more processing power,
which is why queries often run faster on larger warehouses.

There’s an overhead to distributing a query across multiple cores and then


combining the result set at the end though, which means that for a certain
size of data, it can be slower to run a query across more cores. When that
happens, Snowflake won’t distribute a query across any more cores, and
increasing a warehouse size won’t yield speed improvements.

Unfortunately, Snowflake doesn’t provide data on compute core utilization.


The only factor that can be used is the number of micro-partitions scanned
by a query. Each micro-partition can be retrieved by an individual core. If a
query scans a smaller number of micro-partitions than there are cores in
the warehouse, the warehouse will be under-utilized for the table scan step.
Snowflake solutions architects often recommend choosing a warehouse
size such that for each core there are roughly four micro-partitions
scanned. The number of scanned micro-partitions can be seen in the query
profile, or query_history view and table functions.

2. RAM and local storage


Data processing requires a place to store intermediate data sets. Beyond
the CPU caches, RAM is the fastest place to store and retrieve data. Once
it’s used up, Snowflake will start using SSD local storage to persist data
between query execution steps. This behavior is called ‘spillage to local
storage’ - local storage is still fast to access, but not as fast as RAM. If
Snowflake runs out of local storage, then remote storage will be used.
Remote storage means the object store for the cloud provider, so S3 for
AWS. Remote storage is considerably slower to access than local storage,
but it is infinite, which means Snowflake will never abort a query due to out
of memory errors. Spillage to remote storage is the clearest indicator that a
warehouse is undersized, and increasing the warehouse size may improve
the query’s speed by more than double. Both local and remote spillage
volumes can again be seen in the query profile, or query_history view and
table functions.

Cost vs performance
CPU-bound queries will double in speed as the warehouse size increases,
up until the point at which they no longer fully utilize the warehouse’s
resources. Ignoring warehouse idle times from auto-suspend thresholds, a
query which runs twice as fast on a medium than a small warehouse will
cost the same amount to run, as cost = duration x credit usage rate . The
below graph illustrates this behavior, showing that at a certain point, the
execution time for bigger warehouses remains the same while the cost
increases. So, how do we find that sweet spot of maximum performance for
the lowest cost?

Determining the best warehouse size for a


Snowflake query
Here’s the process we recommend:

1. Always start with an X-Small.


2. Increase the warehouse size until the query duration stops halving.
When this happens, the warehouse is no longer fully utilized.
3. For the best cost-to-performance ratio, choose a warehouse size one
smaller. For example, if going from a Medium to a Large warehouse
only decreases the query time by 25%, then use the medium
warehouse. If faster performance is needed, use a larger warehouse,
but beware that returns are diminishing at this point.

A warehouse can run more than one query at a time, so where possible
keep warehouses fully loaded and even with light queueing for maximum
efficiency. Warehouses for non-user queries such as transformation
pipelines can often be run at greater efficiency due to the tolerance for
queueing.

Heuristics to identify incorrectly sized warehouses


Experimentation is the only way to exactly determine the best warehouse
size, but here are a few indicators we’ve learned from experience:

Is there remote disk spillage?


A query with significant remote disk spillage typically will at least double in
speed when the warehouse size increases. Remote disk spillage is very
time-consuming, and removing it by providing more RAM and local storage
for the query will give a big speed boost while saving money if the query
runs in less than half the time.

Is there local disk spillage?


Local disk spillage is nowhere near as bad as remote disk spillage, but it still
slows queries down. Increasing warehouse size will speed up the query, but
it’s less likely to double its speed unless the query is also CPU bound. It’s
worth a try though!

Does the query run in less than 10 seconds?


The query is likely not fully utilizing the warehouse’s resources, which
means it can be run on a smaller warehouse more cost-effectively.
Example views for identifying oversized
warehouses
Here's two helpful views we leverage in the SELECT product to help
customers with warehouse sizing:

1. The number of queries by execution time. Here you can see that over
98% of the queries running on this warehouse are taking less than 1
second to execute.
2. The number of queries by utilizable warehouse size. Utilizable
warehouse size represents the size of warehouse a query can fully
utilize. Where lots of queries don't utilize the warehouse's size, it
indicates that the warehouse is oversized or the queries should run
on a smaller warehouse. In this example, over 96% of queries being
run on the warehouse aren’t using all 8 nodes available in the Large
warehouse.
Using partitions scanned as a heuristic
Another helpful heuristic is to look at how many micro-partitions a query
is scanning, and then choose the warehouse size based off that. This
strategy comes from Scott Redding, a resident solutions architect at
Snowflake.

The intuition behind this strategy is that the number of threads available for
processing doubles with each warehouse size increase, and each thread can
process a single micro-partition at a time. You want to ensure that each
thread has plenty of work available (files to process) throughout the query
execution.

To interpret this chart, this goal is to aim for 250 micro-partitions per
thread. If your query needs to scann 2000 micro-partitions, then running
the query on an X-Small will give each thread 250 micro-partitions (files) to
process, which is ideal. Compare this with running the query on a 3XL
warehouse, which has 512 threads. Each of these threads will only get 4
micro-partitions to process, which will likely results in many threads sitting
unused.

The main pitfall with this approach is that while micro-partitions scanned is
a significant factor in the query execution, other factors like query
complexity, exploding joins, and volume of data sorted will also impact the
required processing power.
Closing
Snowflake makes it easy to match workloads to warehouse configurations,
and we’ve seen queries more than double in speed while costing less
money by choosing the correct warehouse size. Increasing warehouse size
isn't the only option available to make a query run faster though, and many
queries can be made to run more efficiently by identifying and resolving
their bottlenecks. We'll provide a detailed guide on query optimization in a
future post, but if you haven't yet, check out our previous post on
clustering.

If you’re interested in being notified when release future posts, subscribe to


our mailing list below.
Snowflake Query Optimization: 16 tips to
make your queries run faster
Date
Monday, February 12, 2024

Ian Whitestone

Co-founder & CEO of SELECT

Niall Woodward

Co-founder & CTO of SELECT

Snowflake's huge popularity is driven by its ability to process large volumes of data at
extremely low latency with minimal configuration. As a result, it is an established favorite of
data teams across thousands of organizations. In this guide, we share optimization techniques
to maximize the performance and efficiency of Snowflake. Follow these best practices to
make queries run faster while also reducing costs.

All the Snowflake performance tuning techniques discussed in this post are based on the real
world strategies SELECT has helped over 100 Snowflake customers employ. If you think
there's something we've missed, we'd love to hear from you! Reach out via email or use the
chat bubble at the bottom of the screen.

Interested in cost optimization?


This post covers query optimization techniques, and how you can leverage them to make
your Snowflake queries run faster. While this can help lower costs, there are better places to
start if that is your primary goal. Be sure to check out our post on Snowflake cost
optimization for actionable strategies to lower your costs.

Snowflake Query Optimization Techniques


The Snowflake query performance optimization techniques in this post broadly fall into three
separate categories:

1. Improve data read efficiency

Queries can sometimes spend considerable time reading data from table storage. This step of
the query is shown as a TableScan in the query profile. A TableScan involves downloading
data over the network from the table's storage location into the virtual warehouse's worker
nodes. This process can be sped up by reducing the volume of data downloaded, or increasing
the virtual warehouse's size.

Snowflake only reads the columns which are selected in a query, and the micro-partitions
relevant to a query's filters - provided that the table's micro-partitions are well-
clustered on the filter condition.

The four techniques to reduce the data downloaded by a query and therefore speed up
TableScans are:

 Reduce the number of columns accessed


 Leverage query pruning & table clustering
 Use clustered columns in join predicates
 Use pre-aggregated tables
2. Improve data processing efficiency

Operations like Joins, Sorts, Aggregates occur downstream of TableScans and can often
become the bottleneck in queries. Strategies to optimize data processing include reducing the
number of query steps, incrementally processing data, and using your knowledge of the data
to improve performance.
Techniques to improve data processing efficiency include:

 Simplify and reduce the number of query operations


 Reduce the volume of data being processed by filtering early
 Avoid repeated references to CTEs
 Remove un-necessary sorts
 Prefer window functions over self-joins
 Avoid joins with an OR condition
 Use your knowledge of the data to help Snowflake process it efficiently
 Avoid querying complex views
 Ensure effective use of the query caches
3. Optimize warehouse configuration

Snowflake's virtual warehouses can be easily configured to support larger and higher
concurrency workloads. Key configurations which improve performance are:

 Increase the warehouse size


 Increase the warehouse cluster count
 Change the warehouse scaling policy

Before diving into optimizations, let's first remind ourselves how to identify what's slowing a
query down.

How to optimize a Snowflake Query


Before you can optimize a Snowflake query, it's important to understand what the actual
bottleneck in the query is by leveraging query profiling. Which operations are slowing it
down, and where should you consequently focus your efforts?

To figure this out, use the Snowflake Query Profile (query plan), and look to the 'Most
Expensive Nodes' section. This tells you which parts of the query are taking up the most
query execution time.
In this example, we can see the bottleneck is the Sort step, which would indicate that we
should focus on improving the data processing efficiency, and possibly increase the
warehouse size. If a query's most expensive node(s) are TableScans, efforts will be best
spent optimizing the data read efficiency of the query.

1. Select fewer columns


This is a simple one, but where it's possible, it makes a big difference. Query requirements
can change over time, and columns that were once useful may no longer be needed or used by
downstream processes. Snowflake stores data in a hybrid-columnar file format called micro-
partitions. This format enables Snowflake to reduce the amount of data that has to be read
from storage. The process of downloading micro-partition data is called scanning, and
reducing the number of columns results in less data transfer over the network.

2. Leverage query pruning


To reduce the number of micro-partitions scanned by a query, a technique known as query
pruning, a few things need to happen:

1. Your query needs to include a filter which limits the data required by the query. This
can be an explicit where filter or an implicit join filter.
2. Your table needs to be well clustered on the column used for filtering.
Running the below query against the hypothetical orders table shown in the diagram will
result in query pruning since (a) the orders table is clustered by created_at (the data is
sorted by created_at) and (b) the where clause explicitly filters the created_at with a
specific date.

select *

from orders

where created_at > '2022/08/14'

To determine whether pruning performance can be improved, take a look at the query
profile's Partitions scanned and Partitions total statistics.

If you're not using a where clause filtering in a query, adding one may speed up the
TableScan significantly (and downstream nodes too as they process less data). If your query
already has where clause filtering but the 'Partitions scanned' are close to the 'Partitions total',
this means that the where clause is not being effectively pruned on.

Improve pruning by:

1. Ensuring where clauses are placed as early in queries as possible, otherwise they may
not be 'pushed down' onto the TableScan step (this also speeds up later steps in the
query)
2. Adding well-clustered columns to join and merge conditions which can be pushed
down as JoinFilters to enabling pruning
3. Making sure that columns used in where filters of a query align with the
table's clustering (learn more about clustering here)
4. Avoiding the use of functions in where conditions - these often prevent Snowflake
from pruning micro-partitions
3. Use clustered columns in join predicates
The most common form of pruning most users will be familiar with is static query pruning.
Here’s a simple example, similar to the one above:

select *

from orders

where order_date > current_date - 7

If the orders table is clustered by order_date, Snowflake’s query optimizer will recognize
that most micro-partitions (files) containing data older than 7 days ago can be ignored. Since
scanning remote data will requires significant processing time, eliminating micro-partitions
will greatly increase the query speed.

A lesser known feature of Snowflake’s query engine is dynamic pruning. Compared to static
pruning which happens before execution during the query planning phase, dynamic query
pruning happens on the fly as the query is being executed.

Consider a process that regularly updates existing records in the orders table through
a MERGE command. Under the hood, a MERGE requires a join between the source table
containing the new/updated records and the target table (orders) that we want to update.

Dynamic pruning kicks in during the join. How does it work? As the Snowflake query engine
reads the data from the source table, it can identify the range of records present and
automatically push down a filter operation to the target table to avoid un-necessary data
scanning.

Let’s ground this in an example. Imagine we have a source table containing 3 records we
need to update in the target orders table, which is clustered by order date. A typically
MERGE operation would match records between the two tables using a unique key, such as
order key. Because these unique keys are usually random, they won’t force any query
pruning. If we instead modify the MERGE condition to match records on both order key and
order date, then dynamic query pruning can kick in. As Snowflake reads data from the source
table, it can detect the range of dates covered by the 3 orders we are updating. It can then
push down that range of dates into a filter on the target side to prevent having to scan that
entire large table.
How can you apply this to your day to day? If you currently have
any MERGE or JOIN operations where significant time is spent scanning the target table (on the
right), then consider whether you can introduce additional predicates to your join clause that
will force query pruning. Note, this will only work if (a) your target table is clustered by
some key (b) the source table (on the left) you are joining to contains a tightly bound range of
records on the cluster key (i.e. a subset of order dates).

Using dbt?
When using an incremental materialization strategy in dbt, a MERGE query will be executed
under the hood. To add in an additional join condition to force dynamic pruning, update
the unique_key array to include the extra column (i.e. updated_at).

{{ config(

materialized='incremental',

unique_key=['order_id', 'updated_at'],

) }}

select *

from {{ ref('stg_orders') }}

...

4. Use pre-aggregated tables


Create 'rollup' or 'derived' tables that contain fewer rows of data. Pre-aggregated tables can
often be designed to provide the information most queries require while using less storage
space. This makes them much faster to query. For retail businesses, a common strategy is to
use a daily orders rollup table for financial and stock reporting, with the raw orders table only
queried where per-order granularity is needed.

5. Simplify!
Each operation in a query takes time to move data around between worker threads.
Consolidating and removing unnecessary operations reduces the amount of network transfer
required to execute a query. It also helps Snowflake reuse computations and save additional
work. Most of the time, CTEs and subqueries do not impact performance, so use them to help
with readability.

In general, having each query do less makes them easier to debug. Additionally, it reduces the
chance of the Snowflake query optimizer making the wrong decision (i.e. picking the wrong
join order).

6. Reduce the volume of data processed


The less data there is, the faster each data processing step completes. Reducing both the
number of columns and rows processed by each step in a query will improve performance.
Here's an example where moving a qualify filter earlier in the query resulted in a 3X increase
in query runtime. The first query profile shows the runtime when the QUALIFY filter happened
after a join.

Because the QUALIFY filter didn't require information after the join, it could be moved earlier
in the query. This results in significantly less data being joined, vastly improving
performance:
For transformation queries that write to another table, a powerful way of reducing the volume
of data processed is incrementalization. For the example of the orders table, we could
configure the query to only process new or updated orders, and merge those results into the
existing table.

7. Repeating CTEs can be faster, sometimes


We've previously written about whether you should use CTEs in Snowflake. Whenever
you reference a CTE more than once in your query, you will see a WithClause operation in
the query profile (see example below). In certain scenarios, this can actually make the query
slower and it can be more efficient to re-write the CTE each time you need to reference it.
When a CTE reaches a certain level of complexity, it’s going to be cheaper to calculate the
CTE once and then pass its results along to downstream references rather than re-compute it
multiple times. This behavior isn’t consistent though, so it’s best to experiment. Here’s a way
to visualize the relationship:
8. Remove un-necessary sorts
Sorting is an expensive operation, so be sure to remove any sorts that are not required:

9. Prefer window functions over self-joins


Rather than using a self join, try and use window functions wherever possible, as self joins
are very expensive since they result in a join explosion:
10. Avoid joins with an OR condition
Similar to self-joins, joins with an OR join condition result in a join explosion since they are
executed as a cartesian join with a post filter operation. Use two left joins instead:
11. Use your knowledge of the data to help Snowflake
process it efficiently
Your own knowledge of the data can be used to improve query performance. For example, if
a query groups on many columns and you know that some of the columns are redundant as
the others already represent the same or a higher granularity, it may be faster to remove those
columns from the group by and join them back in in a separate step.

If a grouped or joined column is heavily skewed (meaning a small number of distinct values
occur most frequently), this can have a detrimental impact on Snowflake's speed. A common
example is grouping by a column that contains a significant number of null values. Filtering
rows with these values out and processing them in a separate operation can result in faster
query speeds.

Finally, range joins can be slow in all data warehouses including Snowflake. Your knowledge
of the interval lengths in the data can be used to reduce the join explosion that occurs. Check
out our recent post on this if you're seeing slow range join performance.

12. Avoid complex views


As a best practice, avoid creating and using any complex views in your queries. Views should
be used to persist simple data transformations like renaming columns, basic column
calculations, or for data models with lightweight joins.
To understand how complex views can wreak havoc, consider the following, seemingly
innocent query:

select

a.*,

b.*

from model_a as a

left join model_b as b

on a.id=b.id

This query was repeatedly taking >45 minutes to run, and failing due to a "Incident".

When digging into the query profile (also referred to as the "query plan"), you can see that
the models being queried were actually complex views, with hundreds of tables.
The solution here is to split up the complex view into simpler, smaller parts, and persist them
as tables.

13. Ensure effective use of the query caches


Each node in a virtual warehouse contains local disk storage which can be used for caching
micro-partitions read from remote storage. If multiple queries access the same set of data in a
table, the queries can scan the data from the local disk cache instead of remote storage, which
can speed up a query if the primary bottleneck is reading data.

When the warehouse suspends, Snowflake doesn't guarantee that the cache will persist when
the warehouse is resumed. The impact of cache loss is that queries have to re-scan data from
table storage, rather than reading it from the much faster local cache. If warehouse cache loss
is impacting queries, increasing the auto-suspend threshold will help.
Separately, Snowflake has a global result cache which will return results for identical queries
executed within 24 hours provided that data in the queried tables is the same. There are
certain situations which can prevent the global result cache from being leveraged (i.e. if your
query has a non-deterministic function), so be sure to check that you are hitting the global
result cache when expected. If not, you may need to tweak your query or reach out to support
to file a bug.

14. Increase the warehouse size


Warehouse size determines the total computational power available to queries running on the
warehouse, also known as vertical scaling.

Increase virtual warehouse size when:

1. Queries are spilling to remote disk (identifiable via the query profile)
2. Query results are needed faster (typically for user-facing applications)

Queries that spill to remote disk run inefficiently due to the large volumes of network traffic
between the warehouse executing the query, and the remote disk that store data used in
executing the query. Increasing the warehouse size doubles both the available RAM and local
disk, which are significantly faster to access than remote disk. Where remote disk spillage
occurs, increasing the warehouse size can more than double a query's speed. We've gone into
more detail on Snowflake warehouse sizing in the past, and covered how to configure
warehouse sizes in dbt too.

Note, if most of the queries running on the warehouse don't require a larger warehouse, and
you want to avoid increasing the warehouse size for all queries, you can instead consider
using Snowflake's Query Acceleration Service. This service, available on Enterprise edition
and above, can be used to give queries which scan a lot of data additional compute resources.
15. Increase the Max Cluster Count
Multi-cluster warehouses, available on Enterprise edition and above, can be used to create
more instances of the same size warehouse.

If there are periods where warehouse queuing causes queries to not meet their required
processing speeds, consider using multi-clustering or increasing the maximum cluster count
in a warehouse. This will allow the warehouse to track query volumes by adding or removing
clusters.

Unlike warehouse cluster count, Snowflake cannot automatically adjust the size of virtual
warehouses with query volumes. This makes multi-cluster warehouses more cost-effective
for processing volatile query volumes, as each cluster is only billable while in an active state.

16. Adjust the Cluster Scaling Policy


Snowflake offers two scaling policies - Standard and Economy. For all virtual warehouses
which serve user-facing queries, use the Standard scaling policy. If you're very cost-
conscious, experiment with the Economy scaling policy for queueing-tolerant workloads such
as data loading to see if it reduces cost while maintaining the required throughput.
Otherwise, we recommend using Standard for all warehouses.

Other Resources
If you are looking for more content on Snowflake query optimization, we recommend
exploring the additional video resources below.

Behind the Cape: 3 Part Series on Snowflake Cost Optimization


(2023)
In this 3 part video series, Ian joined Snowflake data superhero Keith Belanger for Behind
the Cape, a series of videos where Snowflake experts dive into various topics.
Part 1
For this episode, we tackled the meaty topic of Snowflake cost optimization. Given we only
had 30 minutes, it ended up being a higher level conversation around how to get started,
Snowflake's billing model, and tools Snowflake provides to control costs.

Here is a full list of topics we discussed:

1. How should you get started with Snowflake cost optimization? (TL,DR: build up a
holistic understanding of your cost drivers before diving into any optimization efforts)
2. Where most customers are today with their understanding of Snowflake usage
3. How does Snowflake's billing model work (did you know it's actually cheaper to store
data in Snowflake?)
4. The tools offered by Snowflake for cost visibility
5. Methods you have to control costs (resource monitors, query timeouts, and
ACCESS CONTROL - the one no one thinks of!)
6. Where should you start with cost cutting? Start optimizing queries? Or go higher
level?
7. Resources for learning more.

For those looking to get an overview of cost optimization, monitoring and control, this is a
great place to start. The video recording can be found below. There is so much to discuss on
this topic and we didn't get to go super deep, so will have to do a follow up soon!

Part 2
In this episode we go deeper into some important foundational concepts around Snowflake
query optimization:

1. The lifecycle of a Snowflake query


2. Snowflake virtual warehouse sizing
3. How to use the Snowflake query profile and identify bottlenecks

Part 3
In the final episode of the series, we dive into the most important query optimization
techniques:

1. Understanding Snowflake micro-partitions


2. How to leverage query pruning
3. How to ensure your tables are effectively clustered

Snowflake Optimization Power Hour Video (2022)


On September 28, 2022, Ian gave a presentation to the Snowflake Toronto User Group on
Snowflake performance tuning and cost optimization. The following content was covered:
1. Snowflake architecture
2. The lifecycle of a Snowflake query
3. Snowflake's billing model
4. A simple framework for cost optimization, along with a detailed methodology for
how to calculate cost per query
5. Warehouse configuration best practices
6. Table clustering tips

Slides
The slides can be viewed here. To navigate the slides, you can click the arrows on the bottom
right, or use the arrow keys on your keyboard. Press either the "esc" or the "o" key to zoom
out into an "overview" mode where you can see all slides. From there, you can again navigate
using the arrows and either click a slide or press "esc"/"o" to focus on it.

Presentation Recording
A recording of the presentation is available on youtube. The presentation starts at 3:29.

If you would like, I am more than happy to come in and give this presentation (or a variation
of it) to your team where they can have the opportunity to ask questions. Send an email
to ian@select.dev if you would like to set that up.

Query Optimization at Snowflake (2020)


If you'd like to better understand the Snowflake query optimizer internals, I'd highly
recommend watching this talk from Jiaqi Yan, one of Snowflake most senior database
engineers:
Ian Whitestone
Co-founder & CEO of SELECT
Ian is the Co-founder & CEO of SELECT, a software product which helps users
automatically optimize, understand and monitor Snowflake usage. Prior to starting SELECT,
Ian spent 6 years leading full stack data science & engineering teams at Shopify and Capital
One. At Shopify, Ian led the efforts to optimize their data warehouse and increase cost
observability.

Niall Woodward
Co-founder & CTO of SELECT
Niall is the Co-Founder & CTO of SELECT, a software product which helps users
automatically optimize, understand and monitor Snowflake usage. Prior to starting SELECT,
Niall was a data engineer at Brooklyn Data Company and several startups. As an open-source
enthusiast, he's also a maintainer of SQLFluff, and creator of three dbt packages:
dbt_artifacts, dbt_snowflake_monitoring and dbt_snowflake_query_tags.
Contents
1. Snowflake Query Optimization Techniques
2. How to optimize a Snowflake Query

Snowflake Cost Optimization: 15 proven


strategies for reducing costs
Date
Monday, February 12, 2024

Ian Whitestone

Co-founder & CEO of SELECT


Niall Woodward

Co-founder & CTO of SELECT

Snowflake is an incredibly powerful platform, easily scaling to handle ever-larger data


volumes without compromising on performance. But, if not controlled, costs associated with
this scaling quickly climb. Whether your goal is to reduce the price of an upcoming renewal,
extend your existing contract's runway, or reduce on-demand costs, use the strategies in this
post to make significant cost savings.

Everything we discuss is based on the real world strategies SELECT has helped over 100
Snowflake customers employ. If you think there's something we've missed, we'd love to hear
from you! Reach out via email or use the chat bubble at the bottom of the screen.

Looking to speed up your queries?


This post covers cost optimization techniques, and how you can leverage them to eliminate
unnecessary Snowflake credit consumption and free up budget for other workloads. If your
goal is making your Snowflake queries run faster, be sure to check out our post
on Snowflake query optimization for actionable tips to speed up query execution time.

Before you start


Before you start, it's incredibly important that you first understand your actual Snowflake
cost drivers and the Snowflake pricing model. We see many Snowflake customers jump
straight into optimizing specific queries or trying to reduce their storage costs, without
realizing that those places may not be the problem or best place to start.

As a starting point, we recommend looking at the Snowflake cost management overview in


the admin section of the UI and building up an understanding of which services (compute,
storage, serverless, etc.) make up the bulk of your costs. For most customers, compute will be
the biggest driver (typically over 80% of overall Snowflake costs). Once you've figured this
out, your next focus should be understanding which workloads within your virtual
warehouses actually make up the bulk of your costs. You can determine this by calculating a
cost per query and then aggregating those costs based on query metadata (i.e. query tags or
comments).

Cost Optimization Techniques


The cost reduction techniques in this post fall into six broad categories:
1. Virtual warehouse configuration

 Reducing auto-suspend
 Reducing the warehouse size
 Ensure minimum clusters are set to 1
 Consolidate warehouses

2. Workload configuration

 Reducing query frequency


 Only process new or updated data

3. Table configuration

 Ensure your tables are clustered correctly


 Drop unused tables
 Lower data retention
 Use transient tables

5. Data loading patterns

 Avoid frequent DML operations


 Ensure your files are optimally sized before loading

6. Leverage built-in Snowflake controls

 Leverage access control to restrict warehouse usage & modifications


 Enable query timeouts
 Configure Snowflake resource monitors

Let's get straight into it.

1. Reduce auto-suspend to 60 seconds


Use 60 second auto-suspend timeouts for all virtual warehouses. The only place where this
recommendation can differ is for user-facing workloads where low latency is paramount, and
the warehouse cache is frequently being used. If you’re not sure, go for 60 seconds, and
increase it if performance suffers.

alter warehouse compute_wh set auto_suspend=60;


Auto-suspend settings have a big impact on the bill because Snowflake charges for every
second a Snowflake warehouse is running, with a minimum of 60 seconds. For this reason,
we don't recommend setting auto-suspend below 60 seconds, as it can lead to double charges.
Auto-suspend settings over 60 seconds result in virtual warehouses being billable while
processing no queries. By default, all virtual warehouses created via the user interface have
auto-suspend periods of 5 minutes, so be careful when creating new warehouses, too.

Auto-suspend below 60s can result in double billing


Each time a Snowflake virtual warehouse resumes, you are charged for a minimum of 1
minute. After that period, you are charged per second. While it is technically possible to set
the auto suspend value lower than 60s (you can put 0s if you want!), Snowflake will only shut
it down after 30 seconds of inactivity.

Because of the minimum 1 minute billing period, it's possible for users to get double charged
if the auto-suspend is set to 30s. Here's an example:

1. A query comes in and runs for 1s


2. The warehouse shuts down after 30s
3. Another query comes in right after it shuts down and wakes up the warehouse, runs
for 1s
4. The warehouse shuts down again after 30s

Despite only being up for ~1 minute, the user will actually be charged for 2 minutes of
compute in this scenario since the warehouse has resumed twice and you are charged for a
minimum of 1 minute each time it resumes.

2. Reduce virtual warehouse size


Virtual warehouse computation resources and cost scale exponentially. Here’s a quick
reminder, with compute costs displayed per hour as credits (dollars) assuming a typical rate
of $2.5 per credit.

Hourly virtual warehouse pricing

Warehouse X-Small Small Medium Large X-Large


Hourly virtual warehouse pricing

Cost 1 ($2.50) 2 ($5) 4 ($10) 8 ($20) 16 ($40)

Warehouse 2X-Large 3X-Large 4X-Large 5X-Large 6X-Large

Cost 32 ($80) 64 ($160) 128 ($320) 256 ($640) 512 ($1280)

Here’s the monthly pricing for each warehouse assuming running continuously (though not
typically how warehouses run due to auto-suspend, it gives a better sense of cost than
hourly):

Monthly virtual warehouse pricing

Warehouse X-Small Small Medium Large X-Large

Cost 720 ($1,800) 1,440 ($3,600) 2,880 ($7,200) 5,760 ($14,400) 11,520 ($28

Warehouse 2X-Large 3X-Large 4X-Large 5X-Large 6X-Large


Monthly virtual warehouse pricing

Cost 23,040 ($57,600) 46,080 ($115,200) 92,160 ($230,400) 184,320 ($460,800) 368,640 ($9

Over-sized warehouses can sometimes make up the majority of Snowflake usage. Reduce
warehouse sizes and observe the impact on workloads. If performance is still acceptable, try
reducing size again. Check out our full guide to choosing the right warehouse size in
Snowflake, which includes practical heuristics you can use to identify oversized warehouses.

Example of reducing the warehouse size:

As a quick practical example, consider a data loading job that loads ten files every hour on a
Small size warehouse. A small size warehouse has 2 nodes and a total of 16 cores available
for processing. This job can at most saturate 10 out of the 16 cores (1 file per core), meaning
this warehouse will not be fully utilized. It would be significantly more cost effective to run
this job on an X-Small warehouse.

3. Ensure minimum clusters are set to 1


Snowflake Enterprise editions or higher provide multi-cluster warehousing, allowing
warehouses to add additional clusters in parallel to handle increased demand. The minimum
cluster count setting should always be set to 1 to avoid over-provisioning. Snowflake will
automatically add clusters up to the maximum cluster count with minimal provisioning time,
as needed. Minimum cluster counts higher than 1 lead to unused and billable clusters.

alter warehouse compute_wh set min_cluster_count=1;

4. Consolidate warehouses
A big problem we see with many Snowflake customers is warehouse sprawl. When there
are too many warehouses, many of them will not be fully saturated with queries and they will
sit idle, resulting in unnecessary credit consumption.

Here's an example of the warehouses in our own Snowflake account, visualized in the
SELECT product. We calculate & surface a custom metric called warehouse utilization
efficiency, which looks at the % of time the warehouse is active and processing queries.
Looking at the SELECT_BACKEND_LARGE warehouse in the second row, this warehouse has a
low utilization efficiency of 11%, meaning that 89% of the time we are paying for it, it is
sitting idle and not processing any queries. There are several other warehouses with low
efficiency as well.

The best way to ensure virtual warehouses are being utilized efficiently is to use as few as
possible. Where needed, create separate warehouses based on performance requirements
versus domains of workload.

For example, creating one warehouse for all data loading, one for transformations, and one
for live BI querying will lead to better cost efficiency than one warehouse for marketing data
and one for finance data. All data-loading workloads typically have the same performance
requirements (tolerate some queueing) and can often share a multi-cluster X-Small
warehouse. In contrast, all live, user-facing queries may benefit from a larger warehouse to
reduce latency.

Where workloads within each category (loading, transformation, live querying, etc.) need a
larger warehouse size for acceptable query speeds, create a new larger warehouse just for
those. For best cost efficiency, queries should always run on the smallest warehouse they
perform sufficiently quickly on.
5. Reduce query frequency
At many organizations, batch data transformation jobs often run hourly by default. But do
downstream use cases need such low latency? Here are some examples of how reducing run
frequency can have an immediate impact on cost. In this example, we assume all workloads
are non-incremental and so perform a full data refresh each run, and that the initial cost of
running hourly was $100,000.

Run frequency An

Hourly $100,000

Hourly on weekdays, daily at the weekend $75,000

Hourly during working hours $50,000


Run frequency An

Once at the start of the working day, once around midday $8000

Daily $4000

6. Only process new or updated data


Significant volumes of data are often immutable, meaning they don’t change once created.
Examples of these include web and shipment events. Some do change, but rarely beyond a
certain time interval, such as orders which are unlikely to be altered beyond a month after
purchase.

Rather than reprocessing all data in every batch data transformation job, incrementalization
can be used to filter for only records which are new or updated within a certain time window,
perform the transformations, and then insert or update that data into the final table.

For a year-old events table, the impact of switching to incremental, insert-only


transformations for new records only could reduce costs by 99%. For a year-old orders table,
re-processing and updating only the last month’s and new orders could reduce costs by 90%
compared with full refreshes.

Here's an example of the cost reduction we achieved by converting one of our data models to
only process new data:
7. Ensure tables are clustered correctly
One of the most important query optimization techniques is query pruning: a technique to
reduce the number of micro-partitions scanned when executing a query. Reading micro-
partitions is one of the most expensive steps in a query, since it involves reading data
remotely over the network. If a filter is applied in a where clause, join, or subquery,
Snowflake will attempt to eliminate any micro-partitions it knows don’t contain relevant data.
For this to work, the micro-partitions have to contain a narrow range of values for the column
you're filtering on.

In order for query pruning to be possible, the table in Snowflake needs to


be clustered correctly based on the query access patterns. Consider an orders table, where
users frequently filter for orders created after ( created_at) a certain date. A table like this
should be clustered by created_at.
When a user runs the query below against the orders table, query pruning can eliminate most
of the micro-partitions from being scanned, which greatly reduces query runtime and
therefore results in lower costs.

select *

from orders

where created_at > '2022/08/14'

8. Drop unused tables


While typically only a small (<20%) component of the overal Snowflake costs, unused tables
and time-travel backups can eat away at your Snowflake credits. We've previously written
about how users can identify unused tables if they are on Snowflake Enterprise edition or
higher. If not, use the TABLE_STORAGE_METRICS view to rank order
by TOTAL_BILLABLE_BYTES to find the tables with the highest storage costs.

select

table_catalog as database_name,

table_schema as schema_name,

table_name,

(active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes) as


total_billable_bytes

from snowflake.account_usage.table_storage_metrics

order by total_billable_bytes desc

limit 10

9. Lower data retention


As discussed in our post on Snowflake storage costs, the time travel (data retention) setting
can result in added costs since it must maintain copies of all modifications and changes to a
table made over the retention period. All Snowflake users should question whether they need
to be able to access a historical version of their tables. And if this is needed, how many days
of history do you need to retain.
To lower the data retention period for a specific table, you can run the query below:

alter table table set data_retention_time_in_days=0;

Or to make the change account wide, run:

alter account set data_retention_time_in_days=0;

10. Use transient tables


Fail safe storage is another source of storage costs that can add up, particularly for tables
with lots of churn (more on this below).

If your tables are regularly deleted and re-created as part of some ETL process, or if you have
a separate copy of the data available in cloud storage, then there is no need for backing up
their data in most cases. By changing a table from pernanent to transient, you can avoid
spending unnecessarily on Fail Safe and Time Travel backups.

-- example query to create a table as transient

create or replace transient table orders as (

select *

from raw.orders

...

11. Avoid frequent DML operations


A well known anti-pattern in Snowflake is treating it like an operational database where you
frequently update, insert, or delete a small number of records.

Why should this be avoided? For two reasons:

1. Snowflake tables are stored in immutable micro-partitions, which Snowflake aims


to keep around 16MB compressed. A single micro-partition can therefore contain
hundreds of thousands of records. Each time you update or delete a single record,
Snowflake must re-create the entire micro-partition. This means that updating a single
record can mean updating hundreds of thousands of records. inserts can also be
affected by this due to a process known as small file compaction, where Snowflake
will instead try and combine new records with existing micro-partitions instead of
creating a new one with only a few records.
2. For the time travel & fail safe storage features, Snowflake must keep copies of all
versions of a table. If micro-partitions are frequently being re-created, the amount of
storage will increase significantly. For tables with high churn (frequent updates), the
time travel and fail safe storage can often become greater than the active storage of
the table itself. You can read about the life cycle of a Snowflake table for more
details on this topic.

12. Ensure files are optimally sized


To ensure cost effective data loading, a best practice is to keep your files around 100-250MB.
To demonstrate these effects, consider the image below. If we only have one 1GB file, we
will only saturate 1/16 threads on a Small warehouse used for loading.

If you instead split this file into ten files that are 100 MB each, you will utilize 10 threads out
of 16. This level parallelization is much better as it leads to better utilisation of the given
compute resources (although it's worth noting that an X-Small would still be the better choice
in this scenario).

Have too many small files can also lead to excessive costs if you are using Snowpipe for data
loading, since Snowflake charges an overhead fee of 0.06 credits per 1000 files loaded.

13. Leverage access control


Access control is a powerful technique for controlling costs that many Snowflake customers
don't think of. By restricting who can make changes to virtual warehouses, you will minimize
the chances of someone accidentally making an unintended resource modification that leads
to unexpected costs. We've seen many scenarios where someone increases a virtual
warehouse size and then forgets to change it back. By implementing stricter access control,
companies can ensure that resource modifications go through a controlled process and
minimize the chance of unintended changes being made.

You can also use access control to limit which users can run queries on certain warehouses.
By only allowing users to use smaller warehouses, you will force them to write more efficient
queries rather than defaulting to running on a larger warehouse size. When required, there can
be policies or processes in place to allow certain queries/users to run on a larger warehouse
wehn absolutely necessary.

14. Enable query timeouts


Query timeouts are a setting that prevents Snowflake queries from running for too long, and
consequently from costing too much. If a query runs for longer than the timeout setting, the
query is automatically cancelled by Snowflake.

Query timeouts are a great way to mitigate the impact of runaway queries. By default, a
Snowflake query can run for two days before it is cancelled, racking up significant costs. We
recommend you put query timeouts in place on all warehouses to mitigate the maximum cost
a single query can incur. See our post on the topic for more advice on how to set these.

15. Configure resource monitors


Similar to query timeouts, resource monitors allow you to restrict the total cost a given
warehosue can incur. You can use resource monitors for two purposes:

1. To send you a notification once costs reach a certain threshold


2. To restrict a warehouse from costing more than a certain amount in a given time
period. Snowflake can prevent queries from running on a warehouse if it has
surpassed it's quota.
Resource monitors are a great way to avoid surprises in your bill and prevent unnecessary
costs from occuring in the first place.

A final word of advice


There's a great quote from the world of FinOps which is worth sharing:

The greatest cloud savings emerge from costs never born

In this post, we've shared a bunch of ways you can lower your Snowflake costs to meet your
cost reduction targets, or free up your budget for new workloads. But, getting to a place
where you need these techniques means you have have already incurred these costs and
potentially operated in a suboptimal way for an extended period.

One of the best ways to prevent unnecessary costs is to implement an effective cost
monitoring strategy from the start. Build your own dashboard on top of the Snowflake
account usage views and review it each week, or try out a purpose built cost monitoring
product like SELECT. By catching spend issues early, you can prevent unnecessary costs
from piling up in the first place.

The Missing Manual: Everything You Need to Know


about Snowflake Cost Optimization (April 2023)
If you're looking for a presentation which covers many of the topics discussed in the post, we
recommend watching the talk we gave at Data Council in April 2023.
In this talk, we cover everything you need to know about cost and performance optimization
in Snowflake. We start with a deep dive into Snowflake’s architecture & billing model,
covering key concepts like virtual warehouses, micro-partitioning, the lifecycle of a query
and Snowflake’s two-tiered cache. We go in depth on the most important optimization
strategies, like virtual warehouse configuration, table clustering and query writing best
practices. Throughout the talk, we share code snippets and other resources you can leverage
to get the most out of Snowflake.

Recording

A recording of the presentation is available on YouTube.

If you would like, we are more than happy to come in and give this presentation (or a
variation of it) to your team where they can have the opportunity to ask questions. Send an
email to ian@select.dev if you would like to set that up.

Slides

Ian Whitestone
Co-founder & CEO of SELECT
Ian is the Co-founder & CEO of SELECT, a software product which helps users
automatically optimize, understand and monitor Snowflake usage. Prior to starting SELECT,
Ian spent 6 years leading full stack data science & engineering teams at Shopify and Capital
One. At Shopify, Ian led the efforts to optimize their data warehouse and increase cost
observability.

Niall Woodward
Co-founder & CTO of SELECT
Niall is the Co-Founder & CTO of SELECT, a software product which helps users
automatically optimize, understand and monitor Snowflake usage. Prior to starting SELECT,
Niall was a data engineer at Brooklyn Data Company and several startups. As an open-source
enthusiast, he's also a maintainer of SQLFluff, and creator of three dbt packages:
dbt_artifacts, dbt_snowflake_monitoring and dbt_snowflake_query_tags.
Contents
1. Before you start
2. Cost Optimization Techniques
3. 1. Reduce auto-suspend to 60 seconds
4. 2. Reduce virtual warehouse size
5. 3. Ensure minimum clusters are set to 1
6. 4. Consolidate warehouses
7. 5. Reduce query frequency
8. 6. Only process new or updated data
9. 7. Ensure tables are clustered correctly
10.8. Drop unused tables
11.9. Lower data retention
12.10. Use transient tables
13.11. Avoid frequent DML operations
14.12. Ensure files are optimally sized
15.13. Leverage access control
16.14. Enable query timeouts
17.15. Configure resource monitors
18.A final word of advice
19.The Missing Manual: Everything You Need to Know about Snowflake
Cost Optimization (April 2023)
Optimize your Snowflake usage

SELECT automatically optimizes and helps you manage your Snowflake usage with ease.

Try for free


Get up and running with SELECT in 15 minutes.

Snowflake optimization & cost


management platform
Gain visibility into Snowflake usage, optimize performance and automate
savings with the click of a button.

Get access now →Book a Demo


Snowflake Architecture Explained: 3
Crucial Layers
Date
Monday, September 12, 2022

Ian Whitestone

Co-founder & CEO of SELECT

Snowflake has skyrocketed in popularity over the past 5 years and firmly
planted itself at the center of many companies' data stacks. Snowflake
came into existence in 2012 with a unique architecture, described in
their seminal white paper as "the elastic data warehouse". Rather than
have compute and storage coupled on the same machine like their
competitors did , they proposed a new design that took advantage of the
1

near-infinite resources available in cloud computing platforms like Amazon


Web Services (AWS). In this post, we'll dive into the three layers of
Snowflake's data warehouse architecture : cloud services, compute and
2

storage.

Cloud Services
The cloud services layer is the entry point for all interactions a user will have
with Snowflake. It consists of stateless services, backed by
a FoundationDB database storing all required metadata. Authentication
and access control (who can access Snowflake and what can they do within
it) are examples of services in this layer. Query compilation and
optimization are other critical roles handled by cloud services. Snowflake
performs performance optimizations like reducing the number of micro-
partitions that a given user's query needs to scan (compile-time pruning).
Cloud services are also responsible for infrastructure and transaction
management. When new virtual warehouses need to be provisioned to
serve a query, cloud services will ensure they become available. If a query is
attempting to access data that is being updated by another transaction, the
cloud services layer waits for the update to complete before results are
returned.

From a performance standpoint, one of the most important roles of cloud


services is to cache query results in its global result cache, which can be
returned extremely quickly if the same query is run again . This can greatly
3

reduce the load on the compute layer , which we'll discuss next.
4

Compute
After a given query has passed through cloud services, it is sent to the
compute layer for execution. The compute layer is composed of all virtual
warehouses a customer has created. Virtual warehouses are an abstraction
over one or more compute instances, or "nodes". For Snowflake accounts
running on Amazon Web Services, a node would be equivalent to a single
EC2 instance. Snowflake uses t-shirt sizing for its warehouses to configure
how many nodes they will have. Customers will typically create separate
warehouses for different workloads. In the image below, we can see a
hypothetical setup with 3 virtual warehouses: a small warehouse used for
business intelligence , an extra-small warehouse used for loading data into
5

Snowflake, and a large warehouse used for data transformations.


If we zoom in on the extra-small warehouse, the smallest size warehouse
offered by Snowflake, we can see that it consists of a single node. Each
node has 8 cores/threads, 16GB of memory (RAM), and an SSD cache , with6

the exception of 5XL and 6XL which run on different node specifications.
With each warehouse size increase, the number of nodes in the warehouse
will double. This means that the number of threads, memory, and disk
space will also double. A size small warehouse will have twice as much
memory (32GB), twice as many cores (16) and double the amount of disk
space that an extra-small warehouse will have. By extension, a large
warehouse will have 8 times the resources of an extra-small warehouse.

An important aspect of Snowflake's design is that the nodes in each


running warehouse are not used anywhere else. This provides users with a
strong performance guarantee that their queries won't be impacted by
queries running on other warehouses within an account, giving Snowflake
customers the ability to run highly performant and predictable data
workloads. This differs from the cloud services layer which is shared across
accounts, with less consistent timings, though in practice this is
inconsequential.

Storage
Snowflake stores your tables in a scalable cloud storage service (S3 if you
are on AWS, Azure Blob for Azure, etc.). Every table is partitioned into a
7
number of immutable micro-partitions. Micro-partitions use
a proprietary, closed-source file format created by Snowflake. Snowflake
aims to keep them around 16MB, heavily compressed . As a result, there
8

can be millions of micro-partitions for a single table.

Micro-partitions leverage a columnar storage format instead of a row based


layout that is typically used by OLTP databases like Postgres, SQLite, MySql,
SQLServer, etc. Since analytical queries typically select a few columns across
a wide range of rows, columnar storage formats will achieve significantly
better performance.
Column-level metadata is calculated whenever a micro-partition is created.
The min/max value, count, # of distinct values, # of nulls are some examples
of the metadata that is calculated and stored in cloud services. As discussed
earlier, Snowflake can leverage this metadata during query optimization
and planning to figure out the exact micro-partitions that must be scanned
for a particular query, an important technique known as "pruning". This
can vastly speed up queries by eliminating unnecessary, slow data reads.

Because micro-partitions are immutable, DML operations (updates,


additions, deletes) must add or remove entire files and re-calculate the
required metadata. Snowflake recommends performing DML operations in
batches to reduce the number of micro-partitions which are rewritten,
which will reduce the total run time and cost of the operations.

Since a table object in Snowflake is essentially a cloud services entry that


references a collection of micro-partitions, Snowflake is able to offer
innovative storage features like zero-copy cloning and time travel. When
you create a new table as a clone of an existing table, Snowflake creates a
new metadata entry that points to the same set of micro-partitions. In time
travel, Snowflake tracks which micro-partitions a table was comprised of
over time, allowing users to access the exact version of a table at a
particular point in time.

Summary
Snowflake's unique, scalable architecture has allowed it to quickly become
the dominant data warehouse of today. In future posts, we'll dive deeper
into each individual layer in Snowflake's architecture and discuss how you
can take advantage of their features to maximize query
performance and lower costs.

Notes
1
At the time, Snowflake's main competitors were Amazon Redshift and
traditional on-premise offerings like Oracle and Teradata. These existing
solutions all coupled storage and compute on the same machines, making
them difficult and expensive to scale. Today, Snowflake's bigger
competitors are the likes of BigQuery and Databricks. BigQuery likely has a
similar market share, if not greater, due to their seamless integration with
the rest of their Google Cloud Platform. Databricks has become a new
competitor as both companies are beginning to re-position themselves as
"data clouds". ↩

2
With their move to become a full-on "data cloud", Snowflake is rapidly
adding new functionality like Snowpark, Unistore, External Tables, Streamlit
and a native App store - all of which extend Snowflake's architecture. We'll
be ignoring these new capabilities in this architecture review, and focusing
on the data warehousing aspects that most customers use as of today. ↩

3
The queries must be identical in order to be served from the global result
cache. In addition, Snowflake actually has two different caches which can
benefit performance: a global result cache and a local cache in each
warehouse. We'll cover both in more detail in a future post. ↩

4
In addition to serving previously run queries from the global result cache,
Snowflake can also process certain queries
like count(*) or max(column) entirely by leveraging the metadata storage.
Learn more in our post about micro-partitions ↩

5
This warehouse is actually a multi-cluster warehouse, which means
Snowflake will allocate additional compute resources if the query demand
surpasses what a single small warehouse can handle. We'll cover multi-
cluster warehouses in more depth in a future post. ↩
6
These figures are for AWS, and will differ slightly for other cloud providers.
They are not guaranteed to be accurate since Snowflake does not publish
them, and can change the underlying servers and warehouse configurations
at any point. These figures I provided were last validated in August 2022
through two separate sources. Appears to be consistent with what was
observed in 2019. I have not been able to validate the disk space available
on each node, but plan to figure this out experimentally in the coming
months. ↩

7
The exception to this is if you are using external tables to store you
data. ↩

8
This compression is done automatically by Snowflake under the hood.
Uncompressed, these files can be over 500MB!
Calculating cost per query in Snowflake
Date
Wednesday, October 12, 2022

Ian Whitestone

Co-founder & CEO of SELECT

For most Snowflake customers, compute costs (the charges for virtual
warehouses), will make up the largest portion of the bill. To effectively
reduce this spend, high cost-driving queries need to be accurately
identified.

Snowflake customers are billed for each second that virtual warehouses
1

are running, with a minimum 60 second charge each time one is resumed.
The Snowflake UI currently provides a breakdown of cost per virtual
warehouse, but doesn't attribute spend at a more granular, per-query level.
This post provides a detailed overview and comparison of different ways to
attribute warehouse costs to queries, along with the code required to do
so.
Skip to the final SQL?
If you want to skip ahead and see the SQL implementation for the
recommended approach, you can head straight to the end!

Simple approach
We'll start with a simple approach which multiplies a query's execution time
with the billing rate for the warehouse it ran on. For example, say a query
ran for 10 minutes on a medium size warehouse. A medium warehouse
costs 4 credits per hour, and with a cost of $3 per credit , we'd say this 2

query costs $2 (10/60 hours * 4 credits / hour * $3/credit ).

SQL Implementation
We can implement this in SQL by leveraging
the snowflake.account_usage.query_history view which contains all queries from
the last year along with key metadata like the total execution time and size
of the warehouse the query ran on:

WITH

warehouse_sizes AS (

SELECT 'X-Small' AS warehouse_size, 1 AS credits_per_hour UNION ALL

SELECT 'Small' AS warehouse_size, 2 AS credits_per_hour UNION ALL

SELECT 'Medium' AS warehouse_size, 4 AS credits_per_hour UNION ALL

SELECT 'Large' AS warehouse_size, 8 AS credits_per_hour UNION ALL

SELECT 'X-Large' AS warehouse_size, 16 AS credits_per_hour UNION ALL

SELECT '2X-Large' AS warehouse_size, 32 AS credits_per_hour UNION ALL

SELECT '3X-Large' AS warehouse_size, 64 AS credits_per_hour UNION ALL

SELECT '4X-Large' AS warehouse_size, 128 AS credits_per_hour

)
SELECT

qh.query_id,

qh.query_text,

qh.execution_time/(1000*60*60)*wh.credits_per_hour AS query_cost

FROM snowflake.account_usage.query_history AS qh

INNER JOIN warehouse_sizes AS wh

ON qh.warehouse_size=wh.warehouse_size

WHERE

start_time >= CURRENT_DATE - 30

This gives us an estimated query cost for each query_id. To account for the
same query being run multiple times in a period, we can aggregate by
the query_text:

WITH

warehouse_sizes AS (

// same as above

),

queries AS (

SELECT

qh.query_id,

qh.query_text,

qh.execution_time/(1000*60*60)*wh.credits_per_hour AS query_cost

FROM snowflake.account_usage.query_history AS qh
INNER JOIN warehouse_sizes AS wh

ON qh.warehouse_size=wh.warehouse_size

WHERE

start_time >= CURRENT_DATE - 30

SELECT

query_text,

SUM(query_cost) AS total_query_cost_last_30d

FROM queries

GROUP BY 1

Opportunities for improvement


While simple and easy to understand, the main pitfall with this approach is
that Snowflake does not charge per second a query ran. They charge per
second the warehouse is up. A given query may automatically resume the
warehouse, run for 6 seconds, then cause the warehouse to idle before
being automatically suspended. Snowflake bills for this idle time, and
therefore it can be helpful to "charge back" this cost to the query. Similarly,
if two queries run concurrently on the warehouse for the same 20 minutes,
Snowflake will bill for 20 minutes, not 40. Idle time and concurrency are
therefore important considerations in cost attribution and optimization
efforts.

When aggregating by query_text to get total cost in the period, we grouped


by the un-processed query text. In practice, it is common for the systems
that created these queries to add unique metadata into each query. For
example, Looker will add some context to each query. The first time a query
is run, it may look like this:

SELECT
id,

created_at

FROM orders

-- Looker Query Context '{"user_id":181,"history_slug":"9dcf35a","instance_slug":"aab1f6"}'

And in the next run, this metadata will be different:

SELECT

id,

created_at

FROM orders

-- Looker Query Context '{"user_id":181,"history_slug":"1kal99e","instance_slug":"jju3q8"}'

Similar to Looker, dbt will add its own metadata, giving each query a
unique invocation_id:

SELECT

id,

created_at

FROM orders

/*{

"app": "dbt",

"invocation_id": "52c47806ae6d",

"node_id": "model.jaffle_shop.orders",

...
}*/

When grouped by query_text, the two occurrences of the query above won't
be linked since this metadata makes each one unique. This could result in a
single and potentially easily addressed source of high cost queries (for
example a dashboard) going unidentified.

We may wish to go even further and bucket costs at a higher level. dbt
models often consist of multiple queries being run: a CREATE TEMPORARY
TABLE followed by a MERGE statement. A given dashboard may trigger 5
different queries each time it is refreshed. Being able to group the entire
collection of queries from a single origin is very useful for attributing spend
and then targeting improvements in a time efficient manner.

With these opportunities in mind, can we do better?

New approach
To be able to reconcile the total attributed query costs with the final bill, it's
important to start with the exact charges for each warehouse. The decision
to use an hourly granularity comes
from snowflake.account_usage.warehouse_metering_history , the source of truth for
warehouse charges, which reports credit consumption at an hourly level.
We can then calculate how many seconds each query spent executing in
the hour, and allocate the credits proportionally to each query based on
their fraction of the total execution time. In doing so, we will account for
idle time by distributing it among the queries that ran during the period.
Concurrency will also be handled since more queries running will generally
lower the average cost per query.

To ground this in an example, say the TRANSFORMING_WAREHOUSE consumed 100


credits in a single hour. During that time, three queries ran, 2 for 10 minutes
and 1 for 20 minutes, for 40 minutes of total execution time. In this
scenario, we would allocate credits to each query in the following way:

1. Query 1 (10 minutes) -> 25 credits


2. Query 2 (20 minutes) -> 50 credits
3. Query 3 (10 minutes) -> 25 credits

In the diagram below, query 3 begins between 17:00-18:00 and finishes


after 18:00. To account for queries which span multiple hours, we only
include the portion of the query that ran in each hour.

When only one query runs in an hour, like Query 5 below, all credit
consumption is attributed to that one query, including the credits
consumed by the warehouse sitting idle.
SQL Implementation
Some queries don't execute on a warehouse and are processed entirely by
the cloud services layer. To filter those, we remove queries
with warehouse_size IS NULL . We'll also calculate a new
3

timestamp, execution_start_time , to denote the exact time at which the query


began running on the warehouse . 4

SELECT

query_id,

query_text,

warehouse_id,

TIMEADD(

'millisecond',

queued_overload_time + compilation_time +

queued_provisioning_time + queued_repair_time +

list_external_files_time,

start_time

) AS execution_start_time,

end_time

FROM snowflake.account_usage.query_history AS q

WHERE TRUE

AND warehouse_size IS NOT NULL

AND start_time >= CURRENT_DATE - 30


Next, we need to determine how long each query ran in each hour. Say we
have two queries, one that ran within the hour and one that started in one
hour and ended in another.

query_id execution_start_time

123 2022-10-08 08:27:51.234 2022-10-08 08:30:20.812

456 2022-10-08 08:30:11.941 2022-10-08 09:01:56.000

We need to generate a table with one row per hour that the query ran
within.

query_id execution_start_time end_time hour_start

123 2022-10-08 08:27:51.234 2022-10-08 08:30:20.812 2022-10-08 08:00:00.000

456 2022-10-08 08:30:11.941 2022-10-08 09:01:56.000 2022-10-08 08:00:00.000

456 2022-10-08 08:30:11.941 2022-10-08 09:01:56.000 2022-10-08 09:00:00.000


To accomplish this in SQL, we generate a CTE, hours_list, with 1 row per
hour in the 30 day range we are looking at. Then, we perform a range
join with the filtered_queries to get a CTE, query_hours, with 1 row for each
hour that a query executed within.

WITH

filtered_queries AS (

SELECT

query_id,

query_text,

warehouse_id,

TIMEADD(

'millisecond',

queued_overload_time + compilation_time +

queued_provisioning_time + queued_repair_time +

list_external_files_time,

start_time

) AS execution_start_time,

end_time

FROM snowflake.account_usage.query_history AS q

WHERE TRUE

AND warehouse_size IS NOT NULL

AND start_time >= DATEADD('day', -30, DATEADD('day', -1, CURRENT_DATE))


),

hours_list AS (

SELECT

DATEADD(

'hour',

'-' || row_number() over (order by null),

DATEADD('day', '+1', CURRENT_DATE)

) as hour_start,

DATEADD('hour', '+1', hour_start) AS hour_end

FROM TABLE(generator(rowcount => (24*31))) t

),

-- 1 row per hour a query ran

query_hours AS (

SELECT

hl.hour_start,

hl.hour_end,

queries.*

FROM hours_list AS hl

INNER JOIN filtered_queries AS queries

ON hl.hour_start >= DATE_TRUNC('hour', queries.execution_start_time)

AND hl.hour_start < queries.end_time


),

Now we can calculate the number of milliseconds each query ran for within
each hour along with their fraction relative to all queries.

query_seconds_per_hour AS (

SELECT

*,

DATEDIFF('millisecond', GREATEST(execution_start_time, hour_start), LEAST(end_time, hour_end))


AS num_milliseconds_query_ran,

SUM(num_milliseconds_query_ran) OVER (PARTITION BY warehouse_id, hour_start) AS


total_query_milliseconds_in_hour,

num_milliseconds_query_ran/total_query_milliseconds_in_hour AS
fraction_of_total_query_time_in_hour,

hour_start AS hour

FROM query_hours

),

Finally, we get the actual credits used


from snowflake.account_usage.warehouse_metering_history and allocate them to
each query according to the fraction of all execution time that query
contributed. One last aggregation is performed to return the dataset back
to one row per query.

credits_billed_per_hour AS (

SELECT

start_time AS hour,

warehouse_id,

credits_used_compute
FROM snowflake.account_usage.warehouse_metering_history

),

query_cost AS (

SELECT

query.*,

credits.credits_used_compute*2.28 AS actual_warehouse_cost,

credits.credits_used_compute*fraction_of_total_query_time_in_hour*2.28 AS
query_allocated_cost_in_hour

FROM query_seconds_per_hour AS query

INNER JOIN credits_billed_per_hour AS credits

ON query.warehouse_id=credits.warehouse_id

AND query.hour=credits.hour

-- Aggregate back to 1 row per query

SELECT

query_id,

ANY_VALUE(MD5(query_text)) AS query_signature,

ANY_VALUE(query_text) AS query_text,

SUM(query_allocated_cost_in_hour) AS query_cost,

ANY_VALUE(warehouse_id) AS warehouse_id,

SUM(num_milliseconds_query_ran) / 1000 AS execution_time_s

FROM query_cost
GROUP BY 1

Processing the query text


As discussed earlier, many queries will contain custom metadata added as
comments, which restrict our ability to group the same queries together.
Comments in SQL can come in two forms:

1. Single line comments starting with --


2. Single or multi-line comments of the form /* <comment text> */

-- This is a valid SQL comment

SELECT

id,

total_price, -- So is this

created_at /* And this! */

FROM orders

/*

This is also a valid SQL comment.

Woo!

*/

Each of these comment types can be removed using


Snowflake's REGEXP_REPLACE function . 5

SELECT

query_text AS original_query_text,
-- First, we remove comments enclosed by /* <comment text> */

REGEXP_REPLACE(query_text, '(/\*.*\*/)') AS _cleaned_query_text,

-- Next, removes single line comments starting with --

-- and either ending with a new line or end of string

REGEXP_REPLACE(_cleaned_query_text, '(--.*$)|(--.*\n)') AS cleaned_query_text,

FROM snowflake.account_usage.query_history AS q

Now we can aggregate by cleaned_query_text instead of the


original query_text when identifying the most expensive queries in a
particular timeframe. To see the final version of the SQL implementation
using this cleaned_query_text , head to the appendix.

Opportunities for improvement


While this method is a great improvement over the simple approach, there
are still opportunities to make it better. The credits associated with
warehouse idle time are distributed across all queries that ran in a given
hour. Instead, attributing idle spend only to the query or queries that
directly caused it will improve the accuracy of the model, and therefore its
effectiveness in guiding cost reduction efforts.

This approach also does not take into account the minimum 60-second
billing charge. If there are two queries run separately in a given hour, and
one takes 1 second to execute and another takes 60 seconds, the second
query will appear 60 times more than expensive than the first query, even
though that first query consumes 60 seconds worth of credits.

The query_text processing technique has room for improvement too. It's not
uncommon for incremental data models to have hardcoded dates
generated into the SQL, which change on each run. For example:

-- Query run on 2022-10-03


CREATE TEMPORARY TABLE orders AS (

SELECT

...

FROM orders

WHERE

created_at BETWEEN DATE'2022-10-01' AND DATE'2022-10-02'

You can also see this behaviour in parameterized dashboard queries. For
example, a marketing dashboard may expose a templated query:

SELECT

id,

email

FROM customers

WHERE

country_code = {{ selected_country_code }}

AND signup_date >= CURRENT_DATE - {{ signup_days_back }}

Each time this same query is run, it is populated with different values:

SELECT

id,

email

FROM customers
WHERE

country_code = 'CA'

AND signup_date >= CURRENT_DATE - 90

While the parameterized queries can be handled with more advanced SQL
text processing, idle and minimum billing times are trickier. At the end of
the day, the purpose of attributing warehouse costs to queries is to help
users determine where they should focus their time. With this current
approach, we strongly believe it will let you accomplish this goal. All
models are wrong, but some are useful.

Planned future enhancements


In addition to the more advanced SQL text processing discussed above,
there are a few other enhancements we plan to make to this approach:

 If cloud service credits exceed 10% of your daily compute credits,


Snowflake will begin to charge you for them. To improve the
robustness of this model, we need to account for the cloud services
credits associated with each query that ran in a warehouse, as well as
the queries that did not run in any warehouse. Simple queries
like SHOW TABLES that only run in cloud services can end up consuming
credits if they are executed very frequently. See this post on how
Metabase metadata queries were costing $500/month in cloud
services credits.
 Extend the model to calculate cost per data asset, rather than cost
per query. To calculate cost per DBT model, this will involve parsing
the dbt JSON metadata automatically injected into each SQL query
generated by dbt. It could also involve connecting to BI tool
metadata to calculate things like "cost per dashboard".
 We plan to bundle this code into a new dbt package so users can
easily get greater visibility into their Snowflake spend

Notes
1
Snowflake uses the concept of credits for most of its billable services.
When warehouses are running, they consume credits. The rate at which
credits are consumed doubles each time the warehouse size is increased.
An X-Small warehouse costs 1 credit per hour, a small costs 2 credits per
hour, a medium costs 4 credits per hour, etc. Each Snowflake customer will
pay a fixed rate per credit, which is how the final dollar value on the
monthly bill is calculated. ↩

2
The cost per credit will vary based on the plan you are on (Standard,
Enterprise, Business Critical, etc..) and your contract. On demand customers
will generally pay $2/credit for Standard, and $3/credit on Enterprise. If you
sign an annual contract with Snowflake, this rate will get discounted based
on how many credits you purchase up front. All examples here are in US
dollars. ↩

3
It is possible for queries to run without a warehouse by leveraging the
metadata in cloud services. ↩

4
There are a number of things that need to happen before a query can
begin executing in a warehouse, such as query compilation in cloud
services and warehouse provisioning. In a future post we'll dive deep into
the lifecycle of a Snowflake query. ↩

5
, the REGEX '(/\*.*\*/)' won't work for two comments on the same line,
such as /* hi */SELECT * FROM table/* hello there */ ↩

Appendix
Complete SQL Query
For a Snowflake account with ~9 million queries per month, the query
below took 93 seconds on an X-Small warehouse.

WITH

filtered_queries AS (

SELECT
query_id,

query_text AS original_query_text,

-- First, we remove comments enclosed by /* <comment text> */

REGEXP_REPLACE(query_text, '(/\*.*\*/)') AS _cleaned_query_text,

-- Next, removes single line comments starting with --

-- and either ending with a new line or end of string

REGEXP_REPLACE(_cleaned_query_text, '(--.*$)|(--.*\n)') AS cleaned_query_text,

warehouse_id,

TIMEADD(

'millisecond',

queued_overload_time + compilation_time +

queued_provisioning_time + queued_repair_time +

list_external_files_time,

start_time

) AS execution_start_time,

end_time

FROM snowflake.account_usage.query_history AS q

WHERE TRUE

AND warehouse_size IS NOT NULL

AND start_time >= DATEADD('day', -30, DATEADD('day', -1, CURRENT_DATE))


),

-- 1 row per hour from 30 days ago until the end of today

hours_list AS (

SELECT

DATEADD(

'hour',

'-' || row_number() over (order by null),

DATEADD('day', '+1', CURRENT_DATE)

) as hour_start,

DATEADD('hour', '+1', hour_start) AS hour_end

FROM TABLE(generator(rowcount => (24*31))) t

),

-- 1 row per hour a query ran

query_hours AS (

SELECT

hl.hour_start,

hl.hour_end,

queries.*

FROM hours_list AS hl

INNER JOIN filtered_queries AS queries

ON hl.hour_start >= DATE_TRUNC('hour', queries.execution_start_time)


AND hl.hour_start < queries.end_time

),

query_seconds_per_hour AS (

SELECT

*,

DATEDIFF('millisecond', GREATEST(execution_start_time, hour_start), LEAST(end_time, hour_end))


AS num_milliseconds_query_ran,

SUM(num_milliseconds_query_ran) OVER (PARTITION BY warehouse_id, hour_start) AS


total_query_milliseconds_in_hour,

num_milliseconds_query_ran/total_query_milliseconds_in_hour AS
fraction_of_total_query_time_in_hour,

hour_start AS hour

FROM query_hours

),

credits_billed_per_hour AS (

SELECT

start_time AS hour,

warehouse_id,

credits_used_compute

FROM snowflake.account_usage.warehouse_metering_history

),

query_cost AS (

SELECT
query.*,

credits.credits_used_compute*2.28 AS actual_warehouse_cost,

credits.credits_used_compute*fraction_of_total_query_time_in_hour*2.28 AS
query_allocated_cost_in_hour

FROM query_seconds_per_hour AS query

INNER JOIN credits_billed_per_hour AS credits

ON query.warehouse_id=credits.warehouse_id

AND query.hour=credits.hour

),

cost_per_query AS (

SELECT

query_id,

ANY_VALUE(MD5(cleaned_query_text)) AS query_signature,

SUM(query_allocated_cost_in_hour) AS query_cost,

ANY_VALUE(original_query_text) AS original_query_text,

ANY_VALUE(warehouse_id) AS warehouse_id,

SUM(num_milliseconds_query_ran) / 1000 AS execution_time_s

FROM query_cost

GROUP BY 1

SELECT

query_signature,
COUNT(*) AS num_executions,

AVG(query_cost) AS avg_cost_per_execution,

SUM(query_cost) AS total_cost_last_30d,

ANY_VALUE(original_query_text) AS sample_query_text

FROM cost_per_query

GROUP BY 1

Alternative approach considered


Before landing on the final approach presented above, an approach that
more accurately handled concurrency and idle time was considered,
especially across multi-cluster warehouses. Instead of working from the
actual credits charged per hour, this approach leveraged
the snowflake.account_usage.warehouse_events_history view to construct a dataset
with 1 row per second each warehouse cluster was active. Using this
dataset, along with the knowledge of which query ran on which warehouse
cluster, it's possible to more accurately attribute credits to each set of
queries, as shown in the diagram below.
Unfortunately, it was discovered that the warehouse_events_history does not
give a perfect representation of when each warehouse cluster was active, so
this approach was abandoned.`
60x faster database clones in Snowflake
Date
Saturday, October 22, 2022

Niall Woodward

Co-founder & CTO of SELECT

Snowflake's zero-copy cloning feature is extremely powerful for quickly


creating production replica environments. But, anyone who has cloned a
database or schema with a large number of tables has experienced that it
can take over ten minutes to complete. In this post we explore a potential
solution.

Introduction
I had the pleasure of attending dbt’s Coalesce conference in London last
week, and dropped into a really great talk by Felipe Leite and Stephen
Pastan of Miro. They mentioned how they’d achieved a considerable speed
improvement by switching database clones out for multiple table clones. I
had to check it out.

Experiments
Results were collected using the following query:
select

count(*) as query_count,

datediff(seconds, min(start_time), max(end_time)) as duration,

sum(credits_used_cloud_services) as credits_used_cloud_services

from snowflake.account_usage.query_history where query_tag = X;

Setup
Create a database with 10 schemas, 100 tables in each:

import snowflake.connector

con = snowflake.connector.connect(

...

for i in range(1, 11):

con.cursor().execute(f"create schema test.schema_{i};")

for j in range(1, 101):

con.cursor().execute(f"create table test.schema_{i}.table_{j} (i number) as (select 1);")

Control - Database clone

create database test_1 clone test;

This operation took 22m 34s to execute.

Results:
Query count Duration Cloud servi

1 22m 34s 0.179

Experiment 1 - Schema level clones

import snowflake.connector

from snowflake.connector import DictCursor

def clone_database_by_schema(con, source_database, target_database):

con.cursor().execute(f"create database {target_database};")

cursor = con.cursor(DictCursor)

cursor.execute(f"show schemas in database {source_database};")

for i in cursor.fetchall():

if i["name"] not in ("INFORMATION_SCHEMA", "PUBLIC"):

con.cursor().execute_async(f"create schema {target_database}.{i['name']} clone {source_database}.


{i['name']};")

con = snowflake.connector.connect(

...

session_parameters={

'QUERY_TAG': 'test 2',


}

clone_database_by_schema("test", "test_2")

Results:

Query count Duration Cloud servi

12 1m 47s 0.148

Using execute_async executes each SQL statement without waiting for each
to complete, resulting in all 10 schemas being cloned concurrently. A
whopping 10x faster from start to finish compared with the regular
database clone.

Experiment 2 - Table level clones

import snowflake.connector

from snowflake.connector import DictCursor

def clone_database_by_table(con, source_database, target_database):

con.cursor().execute(f"create database {target_database};")

cursor = con.cursor(DictCursor)

cursor.execute(f"show tables in database {source_database};")


results = cursor.fetchall()

schemas_to_create = {r['schema_name'] for r in results}

tables_to_clone = [f"{r['schema_name']}.{r['name']}" for r in results]

for schema in schemas_to_create:

con.cursor().execute(f"create schema {target_database}.{schema};")

for table in tables_to_clone:

con.cursor().execute_async(f"create table {target_database}.{table} clone {source_database}.{table};")

con = snowflake.connector.connect(

...

session_parameters={

'QUERY_TAG': 'test 3',

},

clone_database_by_table("test", "test_3")

This took 1 minute 48s to complete, the limiting factor being the rate at
which the queries could be dispatched by the client (likely due to network
waiting times). To help mitigate that, I distributed the commands across 10
threads:

import snowflake.connector
from snowflake.connector import DictCursor

import threading

class ThreadedRunCommands():

"""Helper class for running queries across a configurable number of threads"""

def __init__(self, con, threads):

self.threads = threads

self.register_command_thread = 0

self.thread_commands = [

[] for _ in range(self.threads)

self.con = con

def register_command(self, command):

self.thread_commands[self.register_command_thread].append(command)

if self.register_command_thread + 1 == self.threads:

self.register_command_thread = 0

else:

self.register_command_thread +=1

def run_command(self, command):


self.con.cursor().execute_async(command)

def run_commands(self, commands):

for command in commands:

self.run_command(command)

def run(self):

procs = []

for v in self.thread_commands:

proc = threading.Thread(target=self.run_commands, args=(v,))

procs.append(proc)

proc.start()

# complete the processes

for proc in procs:

proc.join()

def clone_database_by_table(con, source_database, target_database):

con.cursor().execute(f"create database {target_database};")

cursor = con.cursor(DictCursor)

cursor.execute(f"show tables in database {source_database};")

results = cursor.fetchall()
schemas_to_create = {r['schema_name'] for r in results}

tables_to_clone = [f"{r['schema_name']}.{r['name']}" for r in results]

for schema in schemas_to_create:

con.cursor().execute(f"create schema {target_database}.{schema};")

threaded_run_commands = ThreadedRunCommands(con, 10)

for table in tables_to_clone:

threaded_run_commands.register_command(f"create table {target_database}.{table} clone


{source_database}.{table};")

threaded_run_commands.run()

con = snowflake.connector.connect(

...

session_parameters={

'QUERY_TAG': 'test 4',

clone_database_by_table(con, "test", "test_4")

Results:
Query count Duration Cloud servi

1012 22s 0.165

Using 10 threads, the time between the create database command starting
and the final create table ... clone command completing was only 22
seconds. This is 60x faster than the create database ... clone command.
The bottleneck is still the rate at which queries can be dispatched.

In Summary
The complete results:

Clone strategy Query count End-to-end duration

Control - Database clone 1 22m 34s

Experiment 1 - Schema level clones 12 1m 47s

Experiment 2 - Table level clones 1012 22s

All the queries ran were cloud services only, and did not require a running
warehouse or resume a suspended one.
I hope that Snowflake improves their schema and database clone
functionality, but in the mean time, cloning tables seems to be the way to
go.

Thanks again to Felipe Leite and Stephen Pastan of Miro for sharing this!
3 Ways to Achieve Effective Clustering
in Snowflake
Date
Saturday, November 12, 2022

Niall Woodward

Co-founder & CTO of SELECT

In our previous post on micro-partitions, we dove into how Snowflake's unique storage
format enables a query optimization called pruning. Pairing query design with effective
clustering can dramatically improve pruning and therefore query speeds. We'll explore how
and when you should leverage this powerful Snowflake feature.

What is Snowflake clustering?


Clustering describes the distribution of data across micro-partitions, the unit of storage in
Snowflake, for a particular table. When a table is well-clustered, Snowflake can leverage the
metadata from each micro-partition to minimize the numbers of files the query must scan,
greatly improving query performance. Due to this behaviour, clustering is one of the most
powerful optimization techniques Snowflake users can use to improve
performance and lower costs.

Let's explore this concept with an example.


Example of a well clustered table
In the diagram below, we have a hypothetical orders table that is well-clustered on
the created_at column, as rows with similar created_at values are located in the same
micro-partitions.

Snowflake maintains minimum and maximum value metadata for each column in each micro-
partition. In this table, each micro-partition contains records for a narrow range
of created_at values, so the table is well-clustered on the column. The following query only
scans the first three micro-partitions highlighted, as Snowflake knows it can ignore the rest
based on the where clause and minimum and maximum value micro-partition metadata. This
behavior is called query pruning.

select *

from orders

where created_at > '2022/08/14'

Unsurprisingly, the impact of scanning only three micro-partitions instead of every micro-
partition is that the query runs considerably faster.

When should you use clustering?


Most Snowflake users don’t need to consider clustering. If your queries run fast enough and
you’re comfortably under budget, then it’s really not worth worrying about. But, if you care
about performance and/or cost, you definitely should care about clustering.

Pruning is arguably the most powerful optimization technique available to Snowflake users,
as reducing the amount of data scanned and processed is such a fundamental principle in big
data processing: “The fastest way to process data? Don’t.”
Snowflake’s documentation suggests that clustering is only beneficial for tables containing
“multiple terabytes (TB) of data”. In our experience, however, clustering can have
performance benefits for tables starting at hundreds of megabytes (MB).

Choosing a clustering key


To know whether a table is well-clustered for typical queries against it, you first have to
know what those query patterns are. Snowflake's access_history view provides an easy way
of retrieving historic queries for a particular table.

Frequently used where clause filtering keys are good choices for clustering keys. For
example:

select *

from table_a

where created_at > '2022-09-25'

The above query will benefit from a table that is well-clustered on the created_at column,
as similar values would be contained within the same micro-partition, resulting in only a
small number of micro-partitions being scanned. This pruning determination is performed by
the query compiler in the cloud services layer, prior to execution taking place.

In practice, we recommend starting by exploring the costliest queries in your account,


which will likely highlight queries that prune micro-partitions ineffectively despite using
filtering. These present opportunities to improve table clustering.

How to enable clustering in Snowflake?


Once you know what columns you want to cluster on, you'll need to choose a clustering
method. We like to categorize the options into three.

1. Natural clustering
Suppose there is an ETL process adding new events to an events table each hour. A
column inserted_at represents the time at which events are loaded into the table. Newly
created micro-partitions will each have a tightly bound range of inserted_at values. This
events table would be described to be naturally clustered on the inserted_at column. A
query that filters this table on the inserted_at column will prune micro-partitions
effectively.

When performing a backfill of a table that you'd like to leverage natural, insertion-order
clustering on, make sure to sort the data by the natural clustering key first. That way the
historic records are well-clustered, as well as the new ones that get inserted.
Pros
 No additional expenditure or effort required

Cons
 Only works for queries that filter on a column that correlates to the order in which
data is inserted

2. Automatic clustering service


The automatic clustering service and option 3, manual sorting, involve sorting a table's
data by a particular key. The sorting operation requires computation, which can either be
performed by Snowflake with the automatic clustering service, or manually. The diagram
below uses a date column to illustrate, but a table can be re-clustered by any
expression/column.

The automatic clustering service uses Snowflake-managed compute resources to perform the
re-clustering operation. This service only runs if a 'clustering key' has been set for a table:

-- you can cluster by one or more comma separated columns

alter table my_table cluster by (column_to_cluster_by);

-- or you can cluster by an expression

alter table my_table cluster by (substring(column_to_cluster_by, 5, 15));

The automatic clustering service performs work in the background to create and destroy
micro-partitions so they contain tightly bound ranges of records based on the specified
clustering key. This service is charged based on how much work Snowflake performs, which
depends on the clustering key, the size of the table and how frequently its contents are
modified. Consequently, tables that are frequently modified (inserts, updates, deletes) will
incur higher automatic clustering costs. It's worth noting that the automatic clustering service
only uses the first 5 bytes of a column when performing re-clustering. This means that
column values with the same first few characters won't cause the service to perform any re-
clustering.

The automatic clustering service is simple to use, but easy to spend money with. If you
choose to use it, make sure to monitor both the cost and impact on queries on the table to
determine if it achieves a good price/performance ratio. If you're interested in learning more
about the automatic clustering service, check out this detailed post on the inner workings by
one of Snowflake's engineers.

Pros
 The lowest effort way to cluster on a different key to the natural key.
 Doesn't block or interfere with DML operations.

Cons
 Unpredictable costs.
 Snowflake takes a higher margin on automatic clustering than warehouse compute
costs, which can make automatic clustering less cost-effective than manual re-sorting.

3. Manual sorting
With fully recreated tables
If a table is always fully recreated as part of a transformation/modeling process, the table can
be perfectly clustered on any key by adding an order by statement to the create table as
(CTAS) query:

create or replace my_table as (

with transformations as (

...

select *

from transformations

order by my_cluster_key
)

In this scenario of a table that is always fully recreated, we recommend always using manual
sorting over the automatic clustering service as the table will be well-clustered, and at a much
lower cost than the automatic clustering service.

On existing tables
Manually re-sorting an existing table on a particular key simply replaces the table with a
sorted version of itself. Let’s suppose we have a sales table with entries for lots of different
stores, and most queries on the table always filter for a specific store. We can perform the
following query to ensure that the table is well-clustered on the store_id:

create or replace table sales as (

select * from sales order by store_id

As new sales are added to the table over time, the existing micro-partitions will remain well-
clustered by store_id, but new micro-partitions will contain records for lots of different
stores. That means that older micro-partitions will prune well, but new micro-partitions won't.
Once performance decreases below acceptable levels, the manual re-sorting query can be run
again to ensure that all the micro-partitions are well-clustered on store_id.

The benefit of manual re-sorting over the automatic clustering service is complete control
over how frequently the table is re-clustered, and the associated spend. However, the danger
of this approach is that any DML operations which occur on the table while the create or
replace table operation is running will be undone. Manual re-sorting should only be used
on tables with predictable or pausable DML patterns, where you can be sure that no DML
operations will run while the re-sort is taking place.

Pros
 Provides complete control over the clustering process.
 Lowest cost way to achieve perfect clustering on any key.

Cons
 Higher effort than the automatic clustering service. Requires the user to either
manually execute the sorting query or implement automated orchestration of the
sorting query.
 Replacing an existing table with a sorted version of itself reverses any DML
operations which run during the re-sort.
Which clustering strategy should you use and when?
Always aim to leverage natural clustering as by definition it requires no re-clustering of a
table. Transformation processes that use incremental data processing to only process
new/updated data should always use add an inserted_at or updated_at column for this
reason, as these will be naturally clustered and produce efficient pruning.

It’s common to see that most queries for an organization filter by the same columns, such
as region or store_id. If queries with common filtering patterns are causing full table scans,
then depending on how the table is populated, consider using automatic clustering or manual
re-sorting to cluster on the filtered column. If you’re not sure how you’d implement manual
re-sorting or there's a risk of DML operations running during the re-sort, use the automatic
clustering service.

Other good candidates for re-clustering are tables queried on a timestamp column which
doesn't always correlate to when the data was inserted, so natural clustering can't be used. An
example of this is an events table which is frequently queried on event_created_at or
similar, but events can arrive late and so micro-partitions have time range overlap. Re-
clustering the table on the event_created_at will ensure the queries prune well.

Regardless of the clustering approach chosen, it’s always a good idea to sort data by the
desired clustering key before inserting into the table.

Closing
Ultimately, pruning is achieved with complementary query design and table clustering. The
more data, the more powerful pruning is, with the potential to improve a query's performance
by orders of magnitude.

We’ll go deeper on the topic of clustering in future posts, including the use of
Snowflake’s system$clustering_information function to analyze clustering statistics.
We'll also explore options for when a table needs to be well-clustered on more than one
column, so be sure to subscribe to our mailing list below. Thanks for reading, and please get
in touch via Twitter or email where we'd be happy to answer questions or discuss these
topics in more detail.

Niall Woodward
Defining multiple cluster keys in Snowflake
with materialized views
Date
Sunday, November 20, 2022

Ian Whitestone

Co-founder & CEO of SELECT

In our previous post on Snowflake clustering, we discussed the


importance of understanding table usage patterns when deciding how to
cluster a table. If a particular field is used frequently in where clauses, that
can make a great candidate as a cluster key. But what if there are other
frequently used where predicates that could benefit from clustering?

In this post we compare three options:

1. A single table with multi-column cluster keys


2. Maintaining separate tables clustered by each column
3. Using clustered, materialized views to leverage Snowflake's powerful
automatic pruning optimization feature

The limitations of multi-column cluster keys


When defining a cluster key for a single table, Snowflake allows you to use
more than one column. Let’s say we have an orders table with 1.5 billion
records:

-- 1,500,000,000 records

create table orders as (


select

o_orderdate,

o_orderkey,

o_custkey,

o_clerk

from snowflake_sample_data.tpch_sf1000.orders

A common scenario goes like this. The finance team regularly query specific
date ranges on this table in order to understand our sales volume. We also
have our engineering teams querying this table to investigate specific
orders. On top of that, marketing wants the ability to see all historical
orders for a given customer.

That’s three different access patterns, and as a result, three different


columns we’d want to cluster our table
by: o_orderdate, o_custkey and o_orderkey. As shown in Snowflake’s
documentation, we can define a multi-column cluster key for our table
using all three columns in the cluster by expression : 1

create table orders cluster by (o_orderdate, o_custkey, o_orderkey) as (

select

o_orderdate, -- 2,406 distinct values

o_orderkey, -- 1,500,000,000 distinct values

o_custkey, -- 99,999,998 distinct values

o_clerk

from snowflake_sample_data.tpch_sf1000.orders
)

Access Pattern 1: Query by date

select

o_orderdate,

count(*) as cnt

from orders

where o_orderdate between '1993-03-01' and '1993-03-31'

group by 1

Running a query against a range of dates, we can see from the query profile
that we are getting excellent query pruning. Only 22 out of 1609 micro-
partitions are being scanned.
Access Pattern 2: Query for a specific customer

select *

from orders

where o_custkey = 52671775

When we change our query to look up all orders for a particular customer,
the query pruning is ineffective with 99% of all micro-partitions being
scanned.
Access Pattern 3: Query for a specific order

select *

from orders

where o_orderkey = 5019980134

For our orders based lookup, the third column of our cluster key, we see no
pruning whatsoever with all micro-partitions being scanned in order to find
our 1 record.
Understanding the degraded performance of multi-column
cluster keys
As demonstrated above, the query pruning performance degrades
significantly for predicates (filters) on the second and third columns.

To understand why this is the case, it’s important to understand how


Snowflake’s clustering works for multi-column cluster keys. The simplest
mental model for this is thinking about how you’d organize the data in
“boxes of boxes”. Snowflake first groups the data by o_orderdate. Next,
within each “date” box, it divides the data by o_custkey. In each of those
boxes, it then divides the data by o_orderkey.
Snowflake’s query pruning works by checking the min/max metadata for
the column in each micro-partition. When we query by date, each date has
its own dedicated box so we can quickly discard (prune) irrelevant boxes.
When we query by customer or order key, we have to check each top-level
date box because the min/max value for these columns is a very wide range
(a wide range of customers place orders on each day and the order keys are
random IDs, not ascending with the order date), so it's not possible to rule
out any boxes.

Creating multiple copies of the same table with


different cluster keys
As an alternative approach, we could create and maintain a separate table
for each cluster key:

create table orders_clustered_by_date cluster by (o_orderdate) as (

select

o_orderdate,

o_orderkey,

o_custkey,

o_clerk
from snowflake_sample_data.tpch_sf1000.orders

create table orders_clustered_by_customer cluster by (o_custkey) as (

select

o_orderdate,

o_orderkey,

o_custkey,

o_clerk

from snowflake_sample_data.tpch_sf1000.orders

create table orders_clustered_by_order cluster by (o_orderkey) as (

select

o_orderdate,

o_orderkey,

o_custkey,

o_clerk

from snowflake_sample_data.tpch_sf1000.orders

This approach has clear downsides. Users now have to keep track of the
three different tables and remember which query scenario they should use
each table. Not practical for a table that is widely used. You would also be
responsible for maintaining the three separate copies of this table in your
ETL/ELT pipelines.

Perhaps there’s a better way?

Using clustered, materialized views to leverage


Snowflake's powerful automatic pruning
optimization feature
What are materialized views?
A materialized view is a pre-computed data set derived from a query
specification and stored for later use . We’ll discuss their use cases in a
2

future post, but for now you can read Snowflake docs which cover them in
great detail. When you create a materialized view, like the one below,
Snowflake automatically maintains this derived dataset on your behalf.
When data is added or modified in the base table ( orders), Snowflake
automatically updates the materialized view.

create materialized view orders_aggregated_by_date as (

select

o_orderdate,

count(*) as cnt

from orders

group by 1

Now, if anyone ever runs this query against the base table:

select
o_orderdate,

count(*) as cnt

from orders

group by 1

Snowflake will automatically scan the pre-computed materialized view


instead of re-computing the entire dataset.

Creating automatically clustered materialized views


Materialized views support automatic clustering. Using this, we can create
two new materialized views that separately cluster our orders table
by o_custkey and o_orderkey for optimal performance:

-- these will take some time to execute, since the entire dataset is

-- being materialized (created) for the first time

create materialized view orders_clustered_by_customer cluster by(o_custkey) as (

select

o_orderdate,

o_orderkey,

o_custkey,

o_clerk

from orders

;
create materialized view orders_clustered_by_order cluster by(o_orderkey) as (

select

o_orderdate,

o_orderkey,

o_custkey,

o_clerk

from orders

Technically, we could create a third materialized view that is clustered


by o_orderdate. Instead, we’ll take the more cost effective approach of
leveraging manual sorting on our base orders table:

create table orders as (

select

o_orderdate,

o_orderkey,

o_custkey,

o_clerk

from snowflake_sample_data.tpch_sf1000.orders

-- sort and therefore cluster the table by o_orderdate

order by o_orderdate

)
Re-testing our three access patterns
Access Pattern 1: Query by date

select

o_orderdate,

count(*) as cnt

from orders

where

o_orderdate between date'1993-03-01' and date'1993-03-31'

group by 1

When running a query with a filter on o_orderdate, our original


base orders tables is used since it is naturally clustered by this column.
Access Pattern 2: Query by customer

select *

from orders

where

o_custkey=52671775

When we instead filter by o_custkey, the Snowflake optimizer recognizes that


there is a materialized view clustered by this column, and intelligently
instructs the query execution plan to read from the materialized view.

Look closely!
Note, we don’t have to re-write our query to explicitly tell Snowflake to
query the materialized view, it does this under the hood. Users don’t have
to remember which dataset to query under different scenarios!

Access Pattern 3: Query by order

select *

from orders

where

o_orderkey = 5019980134

Filtering on o_orderkey has similar behaviour, with Snowflake “re-routing” the


query execution to scan our other materialized view instead of the
base orders table.
Clustered materialized view cost considerations
The main downside of using materialized views is the added costs of
maintaining the separate materialized views. There are three components
to consider:

1. Storage costs associated with the new datasets


2. Charges for the managed refreshes of each materialized view. In
order to prevent materialized views from becoming out-of-date,
Snowflake performs automatic background maintenance of
materialized views. When a base table changes, all materialized views
defined on the table are updated by a background service that uses
compute resources provided by Snowflake.
3. Charges for automatic clustering on each materialized view. If a
materialized view is clustered differently from the base table, the
number of micro-partitions changed in the materialized view might
be substantially larger than the number of micro-partitions changed
in the base table.

We’ll provide more guidance on this in a future post, but for now we
recommend monitoring the maintenance costs and automatic clustering
3

costs associated with your materialized views. Your can estimate your
4

storage costs upfront based on the table size and your storage costs . 5

Always consider the costs


It’s very important that Snowflake users consider these added costs. It’s
possible that they are completely offset by the faster downstream queries,
and consequently lower compute costs. It’s also possible that the cost is
fully justified by enabling much faster queries. But, it’s impossible to make
that decision without first calculating the true costs.

Materialized view on a clustered table


Each update to your base table triggers a refresh of all associated
materialized views. So what happens if both your base table and the
materialized view are clustered on different columns?

1. New data gets added to the base table


2. A refresh of the materialized view is triggered
3. Snowflake’s automatic clustering service updates the base table to
improve its clustering
4. Automatic clustering also may kick in for the materialized view
updated in step 2
5. Once step 3 is completed, that may re-trigger steps 2 and 4 for the
materialized view
Be careful with clustering materialized views!
Be very careful when adding a materialized view on top of an automatically
clustered table, since it will significantly increase the maintenance costs of
that materialized view.

Materialized views and DML operations


It’s important to note that you’ll only see the performance benefits from
materialized views on select style queries. DML operations
like updates and deletes won’t benefit. For example, if you run:

update orders

set o_clerk='new clerk'

where o_orderkey=5019980134

The query will do a full table scan on the base orders table and not use the
materialized view.

Summary
In this post we showed how materialized views can be leveraged to create
multiple versions of a table with different clusters keys. This practice can
help significantly improve query performance due to better pruning and
even lower the virtual warehouse costs associated with those queries. As
with anything in Snowflake, these benefits must be carefully considered
against their underlying costs.

In future posts, we’ll explore important topics like how to determine the
optimal cluster keys for your table, estimating the costs of automatic
clustering for a large table, and how to monitor clustering health and
implement more cost effective automatic clustering. We’ll also dive deeper
into defining multiple cluster keys on a single table and when it makes
sense to do so.

As always, don’t hesitate to reach out via Twitter or email where we'd be
happy to answer questions or discuss these topics in more detail. If you
want to get notified when we release a new post, be sure to sign up for our
Snowflake newsletter at the bottom of this page.

Notes
1
Notice how we order the clustering keys from lowest to highest
cardinality? From the Snowflake documentation on multi-column
cluster keys:

If you are defining a multi-column clustering key for a table, the order in
which the columns are specified in the CLUSTER BY clause is important. As a
general rule, Snowflake recommends ordering the columns
from lowest cardinality to highest cardinality. Putting a higher cardinality
column before a lower cardinality column will generally reduce the
effectiveness of clustering on the latter column.

A column’s cardinality is simply the number of distinct values. You can find
this out by running a query:

select

count(*), -- 1,500,000,000

count(distinct o_orderdate), -- 2,406

count(distinct o_orderkey), -- 1,500,000,000

count(distinct o_custkey) -- 99,999,998

from public.orders

So as a result, we cluster by (o_orderdate, o_custkey, o_orderkey . ↩

2
You can only use materialized views if you are on the enterprise (or above)
edition of Snowflake. ↩

3
You can monitor the cost of your materialized view refreshes using the
following query:

select

date_trunc(day, start_time) as date,

table_name as materialized_view_name,
sum(credits_used) as num_credits_used

from snowflake.account_usage.materialized_view_refresh_history

group by 1,2

order by 1,2

4
You can monitor the cost of automatic clustering on your materialized
view using the following query:

select

date_trunc(day, automatic_clustering_history.start_time) as date,

automatic_clustering_history.database_name || '.' || automatic_clustering_history.schema_name || '.' ||


automatic_clustering_history.table_name as materialized_view_name,

sum(credits_used) as num_credits_used

from snowflake.account_usage.automatic_clustering_history

inner join snowflake.account_usage.tables

on automatic_clustering_history.table_id=tables.table_id

and tables.table_type='MATERIALIZED VIEW'

group by 1,2

order by 1,2

5
Most customers on AWS pay $23/TB/month. So if your base table is 10TB,
then each additional materialized view will cost $2,760 / year ( 10*23*12). ↩
Choosing the right warehouse size
in Snowflake
Date
Sunday, November 27, 2022

Niall Woodward

Co-founder & CTO of SELECT

Snowflake users enjoy a lot of flexibility when it comes to compute


configuration. In this post we cover the implications of virtual warehouse
sizing on query speeds, and share some techniques to determine the right
one.

The days of complex and slow cluster resizing are behind us; Snowflake
makes it possible to spin up a new virtual warehouse or resize an existing
one in a matter of seconds. The implications of this are:

1. Significantly reduced compute idling (auto-suspend and scale-in for


multi-cluster warehouses)
2. Better matching of compute power to workloads (ease of
provisioning, de-provisioning and modifying warehouses)

Being able to easily allocate workloads to different warehouse


configurations means faster query run times, improved spend efficiency,
and a better user experience for data teams and their stakeholders. That
leads to the question:
Which warehouse size should I use?

Before we look to answer that question, let's first understand what a virtual
warehouse is, and the impact of size on its available resources and query
processing speed.

What is a virtual warehouse in Snowflake?


Snowflake constructs warehouses from compute nodes. The X-Small uses a
single compute node, a small warehouse uses two nodes, a medium uses
four nodes, and so on. Each node has 8 cores/threads, regardless of cloud
provider. The specifications aren’t published by Snowflake, but it’s fairly well
known that on AWS each node (except for 5XL and 6XL warehouses) is a
c5d.2xlarge EC2 instance, with 16GB of RAM and a 200GB SSD. The
specifications for different cloud providers vary and have been chosen to
provide performance parity across clouds.

While the nodes in each warehouse are physically separated, they operate
in harmony, and Snowflake can utilize all the nodes for a single query.
Consequently, we can work on the basis that each warehouse size increase
doubles the available compute cores, RAM, and disk space available.

What virtual warehouse sizes are available?


Snowflake uses t-shirt sizing names for their warehouses, but unlike t-shirts,
each step up indicates a doubling of resources and credit consumption.
Sizes range from X-Small to 6X-Large. Most Snowflake users will only ever
use the smallest warehouse, the X-Small, as it’s powerful enough for most
datasets up to tens of gigabytes, depending on the complexity of the
workloads.

Warehouse Size Credits

X-Small 1

Small 2

Medium 4

Large 8

X-Large 16

2X-Large 32

3X-Large 64
Warehouse Size Credits

4X-Large 128

5X-Large 256

6X-Large 512

The impact of warehouse size on Snowflake query


speeds
1. Processing power
Snowflake uses parallel processing to execute a query across multiple cores
wherever it is faster to do so. More cores means more processing power,
which is why queries often run faster on larger warehouses.

There’s an overhead to distributing a query across multiple cores and then


combining the result set at the end though, which means that for a certain
size of data, it can be slower to run a query across more cores. When that
happens, Snowflake won’t distribute a query across any more cores, and
increasing a warehouse size won’t yield speed improvements.

Unfortunately, Snowflake doesn’t provide data on compute core utilization.


The only factor that can be used is the number of micro-partitions scanned
by a query. Each micro-partition can be retrieved by an individual core. If a
query scans a smaller number of micro-partitions than there are cores in
the warehouse, the warehouse will be under-utilized for the table scan step.
Snowflake solutions architects often recommend choosing a warehouse
size such that for each core there are roughly four micro-partitions
scanned. The number of scanned micro-partitions can be seen in the query
profile, or query_history view and table functions.

2. RAM and local storage


Data processing requires a place to store intermediate data sets. Beyond
the CPU caches, RAM is the fastest place to store and retrieve data. Once
it’s used up, Snowflake will start using SSD local storage to persist data
between query execution steps. This behavior is called ‘spillage to local
storage’ - local storage is still fast to access, but not as fast as RAM. If
Snowflake runs out of local storage, then remote storage will be used.
Remote storage means the object store for the cloud provider, so S3 for
AWS. Remote storage is considerably slower to access than local storage,
but it is infinite, which means Snowflake will never abort a query due to out
of memory errors. Spillage to remote storage is the clearest indicator that a
warehouse is undersized, and increasing the warehouse size may improve
the query’s speed by more than double. Both local and remote spillage
volumes can again be seen in the query profile, or query_history view and
table functions.

Cost vs performance
CPU-bound queries will double in speed as the warehouse size increases,
up until the point at which they no longer fully utilize the warehouse’s
resources. Ignoring warehouse idle times from auto-suspend thresholds, a
query which runs twice as fast on a medium than a small warehouse will
cost the same amount to run, as cost = duration x credit usage rate . The
below graph illustrates this behavior, showing that at a certain point, the
execution time for bigger warehouses remains the same while the cost
increases. So, how do we find that sweet spot of maximum performance for
the lowest cost?
Determining the best warehouse size for a
Snowflake query
Here’s the process we recommend:

1. Always start with an X-Small.


2. Increase the warehouse size until the query duration stops halving.
When this happens, the warehouse is no longer fully utilized.
3. For the best cost-to-performance ratio, choose a warehouse size one
smaller. For example, if going from a Medium to a Large warehouse
only decreases the query time by 25%, then use the medium
warehouse. If faster performance is needed, use a larger warehouse,
but beware that returns are diminishing at this point.

A warehouse can run more than one query at a time, so where possible
keep warehouses fully loaded and even with light queueing for maximum
efficiency. Warehouses for non-user queries such as transformation
pipelines can often be run at greater efficiency due to the tolerance for
queueing.

Heuristics to identify incorrectly sized warehouses


Experimentation is the only way to exactly determine the best warehouse
size, but here are a few indicators we’ve learned from experience:

Is there remote disk spillage?


A query with significant remote disk spillage typically will at least double in
speed when the warehouse size increases. Remote disk spillage is very
time-consuming, and removing it by providing more RAM and local storage
for the query will give a big speed boost while saving money if the query
runs in less than half the time.

Is there local disk spillage?


Local disk spillage is nowhere near as bad as remote disk spillage, but it still
slows queries down. Increasing warehouse size will speed up the query, but
it’s less likely to double its speed unless the query is also CPU bound. It’s
worth a try though!

Does the query run in less than 10 seconds?


The query is likely not fully utilizing the warehouse’s resources, which
means it can be run on a smaller warehouse more cost-effectively.

Example views for identifying oversized


warehouses
Here's two helpful views we leverage in the SELECT product to help
customers with warehouse sizing:

1. The number of queries by execution time. Here you can see that over
98% of the queries running on this warehouse are taking less than 1
second to execute.
2. The number of queries by utilizable warehouse size. Utilizable
warehouse size represents the size of warehouse a query can fully
utilize. Where lots of queries don't utilize the warehouse's size, it
indicates that the warehouse is oversized or the queries should run
on a smaller warehouse. In this example, over 96% of queries being
run on the warehouse aren’t using all 8 nodes available in the Large
warehouse.
Using partitions scanned as a heuristic
Another helpful heuristic is to look at how many micro-partitions a query
is scanning, and then choose the warehouse size based off that. This
strategy comes from Scott Redding, a resident solutions architect at
Snowflake.

The intuition behind this strategy is that the number of threads available for
processing doubles with each warehouse size increase, and each thread can
process a single micro-partition at a time. You want to ensure that each
thread has plenty of work available (files to process) throughout the query
execution.

To interpret this chart, this goal is to aim for 250 micro-partitions per
thread. If your query needs to scann 2000 micro-partitions, then running
the query on an X-Small will give each thread 250 micro-partitions (files) to
process, which is ideal. Compare this with running the query on a 3XL
warehouse, which has 512 threads. Each of these threads will only get 4
micro-partitions to process, which will likely results in many threads sitting
unused.

The main pitfall with this approach is that while micro-partitions scanned is
a significant factor in the query execution, other factors like query
complexity, exploding joins, and volume of data sorted will also impact the
required processing power.

Closing
Snowflake makes it easy to match workloads to warehouse configurations,
and we’ve seen queries more than double in speed while costing less
money by choosing the correct warehouse size. Increasing warehouse size
isn't the only option available to make a query run faster though, and many
queries can be made to run more efficiently by identifying and resolving
their bottlenecks. We'll provide a detailed guide on query optimization in a
future post, but if you haven't yet, check out our previous post on
clustering.

How to use the Snowflake Query Profile


Date
Monday, December 5, 2022

Ian Whitestone

Co-founder & CEO of SELECT

The Snowflake Query Profile is the single best resource you have to
understand how Snowflake is executing your query and learn how to
improve it. In this post we cover important topics like how to interpret the
Query Profile and the things you should look for when diagnosing
poor query performance.

What is a Snowflake query plan?


Before we can talk about the Query Profile, it's important to understand
what a "query plan" is. For every SQL query in Snowflake, there is a
corresponding query plan produced by the query optimizer. This plan
contains the set of instructions or "steps" required to process any SQL
statement. It's like a data recipe. Because Snowflake automatically figures
out the optimal way to execute a query, a query plan can look different
than the logical order of the associated SQL statement.

In Snowflake, the query plan is a DAG that consists of operators connected


by links. Operators process a set of rows. Example operations include
scanning a table, filtering rows, joining data, aggregating, etc. Links pass
data between operators. To ground this in an example, consider the
following query:

select
date_trunc('day', event_timestamp) as date,

count(*) as num_events

from events

group by 1

order by 1

The corresponding query plan would look something like this:

In this plan there are 4 "operators" and 3 "links":

1. TableScan :
reads the records from the events table in remote storage. It
passes 1.3 million records through a link to the next operator.
1

2. Aggregate: performs our group by date and count operations and


passes 365 records through a link to the next operator.
3. Sort: orders the data by date and passes the same 365 records to the
final operator.
4. Result: returns the query results.

The Query Profile will often refer to operators as "operator nodes" or


"nodes" for short. It's also common to refer to them as "stages".

What is the Snowflake Query Profile?


Query Profile is a feature in the Snowflake UI that gives you detailed
insights into the execution of a query. It contains a visual representation of
the query plan, with all nodes and links represented. Execution details and
statistics are provided for each node as well as the overall query.
When should I use it?
The query profile should be used whenever more diagnostic information
about a query is needed. A common example is to understand why a query
is performing a certain way. The Query Profile can help reveal stages of the
query that take significantly longer to process than others. Similarly, you
can use the Query Profile to figure out why a query is still running and
where it is getting stuck.
Another useful application of the Query Profile is to figure out why a query
didn't return the desired result. By carefully studying the links between
nodes, you can potentially identify parts of your query that are resulting in
dropped rows or duplicates, which may explain your unexpected results.

How do you view a Snowflake Query Profile?


After running a query in the Snowsight query editor, the results pane will
include a link to the Query Profile:

Alternatively, you can navigate to the "Query History" page under the
"Activity" tab. For any query run in the last 14 days, you can click on it and
see the Query Profile.
If you already have the query_id handy, you can take advantage of
Snowflake's structure URLs by filling out this URL template:

 Template: https://app.snowflake.com/<snowflake-region>/<account-
locator>/compute/history/queries/<paste-query-id-here>/profile
 Filled out
example: https://app.snowflake.com/us-east4.gcp/xq35282/compute/history/
queries/01a8c0a5-0000-0b5e-0000-2dd500044a26/profile

Can you programmatically access data in the Snowflake


Query Profile?
Not yet. Snowflake is actively working on a new feature to make it possible
for users to query the data shown in the Query Profile. Stay tuned.

How do you read a Snowflake Query Profile?


Basic Query
We'll start with a simple query anyone can run against the Snowflake
sample dataset:
select

date_trunc('month', o_orderdate) as order_month,

count(*) as num_orders,

sum(o_totalprice) as total_order_value

from snowflake_sample_data.tpch_sf1000.orders

where

year(o_orderdate)=1997

group by order_month

order by order_month

As a first step, it's important to build up a mental model of how each


stage/operator in the Query Profile matches up to the query you wrote.
This is tricky to do at first, but very quick once you get the hang of it.
Clicking on each node will reveal more details about the operator, like the
table it is scanning or the aggregations it is performing, which can help
identify the relevant SQL. The SQL relevant to each operator is highlighted
in the image below:

The Query Profile also contains some useful statistics. To highlight a few:
1. An execution time summary. This shows what % of the total query
execution time was spent on different buckets. The 4 options listed
here include:
1. Processing: time spent on query processing operations like
joins, aggregations, filters, sorts, etc.
2. Local Disk I/O: time spent reading/writing data from/to local
SSD storage. This would include things like spilling to disk, or
reading cached data from local SSD.
3. Remote Disk I/O: time spent reading/writing data from/it
remote storage (i.e. S3 or Azure Blob storage). This would
include things like spilling to remote disk, or reading your
datasets.
4. Initialization: this is an overhead cost to start your query on the
warehouse. In our experience, it is always extremely small and
relatively constant.
2. Query statistics. Information like the number of partitions scanned
out of all possible partitions can be found here. Note that this is
across all tables in the query. Fewer partitions scanned means
the query is pruning well. If your warehouse doesn't have enough
memory to process your query and is spilling to disk, this information
will be reflected here.
3. Number of records shared between each node. This information is
very helpful to understand the volume of data being processed, and
how each node is reducing (or expanding) that number.
4. Percentage of total execution time spent on each node. Shown on
the top right of each node, it indicates the percentage of total
execution time spent on that operator. In this example, 83.2% of the
total execution time was spent on the TableScan operator. This
information is used to populate the "Most Expensive Nodes" list at
the top right of the Query Profile, which simply sorts the nodes by the
percentage of total execution time.
You may notice that the number of rows in/out of the Filter node are the
same, implying that the year(o_orderdate)=1997 SQL code did not accomplish
anything. The filter is eliminating records though, as this table contains 1.5
billion records. This is an unfortunate pitfall of the Query Profile; it does not
show the exact number of records being removed by a particular filter.

As mentioned earlier, you can click on each node to reveal additional


execution details and statistics. On the left, you can see the results of
clicking on the TableScan operator. On the right, the results of
the Aggregate operator are shown.
Multi-step Query
When we modify the filter in the query above to contain a subquery, we
end up with a multi-step query.

select

date_trunc('month', o_orderdate) as order_month,

count(*) as num_orders,

sum(o_totalprice) as total_order_value

from snowflake_sample_data.tpch_sf1000.orders

where

o_totalprice > (select avg(o_totalprice) from snowflake_sample_data.tpch_sf1000.orders)

group by order_month

order by order_month
Unlike above, the query plan now contains two steps. First, Snowflake
executes the subquery and calculates the average o_totalprice. The result is
stored, and used in the second step of the query which has the same 5
operators as our query above.

Complex Query
Here is a slightly more complex query with multiple CTEs, one of which is
2

referenced in two other places.

with

daily_shipments AS (

select

l_shipdate,

sum(l_quantity) AS num_items

from snowflake_sample_data.tpch_sf1000.lineitem

where

l_shipdate >= DATE'1998-01-01'

and l_shipdate <= DATE'1998-08-02'


group by 1

),

daily_summary as (

select

o_orderdate,

count(*) AS num_orders,

any_value(num_items) AS num_items

from snowflake_sample_data.tpch_sf1000.orders

inner join daily_shipments

on orders.o_orderdate=daily_shipments.l_shipdate

group by 1

),

summary_stats as (

select

min(num_items) as min_num_items,

max(num_items) as max_num_items

from daily_shipments

select

daily_summary.*,

summary_stats.*
from daily_summary

cross join summary_stats

There are a few things worth noting in this example. First,


the daily_shipments CTE is computed just once. Any downstream SQL that
references this CTE calls the WithReference operator to access the results of
the CTE rather than re-calculating them each time.

The scanned/total partitions metric is now a combination of both tables


being read in the query. If we click into the TableScan node for
the snowflake_sample_data.tpch_sf1000.orders table, we can see that the table is
being pruned well, with only 154 out of 3242 partitions being scanned. How
is this pruning happening when there was no explicit where filter written in
the SQL? This is the JoinFilter operator in action. Snowflake automatically
applies this neat optimization where it determines the range of dates from
the daily_shipments CTE during the query execution, and then applies those
as a filter to the orders table since the query uses an inner join!

A full mapping of the SQL code to the relevant operator nodes can be
found in the notes below .3

What are things to look for in the Snowflake


Query Profile?
The most common use case of the Query Profile is to understand why a
particular query is not performing well. Now that we've covered the
foundations, here are some indicators you can look for in the Query Profile
as potential culprits of poor query performance:

1. High spillage to remote disk. As soon as there is any data spillage, it


means your warehouse does not have enough memory to process
the data and must temporarily store it elsewhere. Spilling data to
remote disk is extremely slow, and will significantly degrade your
query performance.
2. Large number of partitions scanned. Similarly to spilling data to
remote disk, reading data from remote disk is also very slow. A large
number of partitions scanned means your query has to do a lot of
work reading remote data.
3. Exploding joins. If you see the number of rows coming out of a join
increase, this may indicate that you have incorrectly specificied your
join key. Exploding joins generally take longer to process, and lead to
other issues like spilling to disk.
4. Cartesian joins. Cartesian joins are a cross-join, which produce a
result set which is the number of rows in the first table multiplied by
the number of rows in the second table. Cartesian joins can be
introduced unintentionally when using a non equi-join like a range
join. Due to the volume of data produced, they are both slow and
often lead to out-of-memory issues.
5. Downstream operators blocked by a single CTE. As discussed above,
Snowflake computes each CTE once. If an operator relies on that CTE,
it must wait until it is finished processing. In certain cases, it can be
more beneficial to repeat the CTE as a subquery to allow for parallel
processing.
6. Early, unnecessary sorting. It's common for users to add an unneeded
sort early on in their query. Sorts are expensive, and should be
avoided unless absolutely necessary.
7. Repeated computation of the same view. Each time a view is
referenced in a query, it must be computed. If the view contains
expensive joins, aggregations, or filters, it can sometimes be more
efficient to materialize the view first.
8. A very large Query Profile with lots of nodes. Some queries just have
too much going on and can be vastly improved by simplifying them.
Breaking apart a query into multiple, simpler queries is an effective
technique.

In future posts, we'll dive into each of these signals in more detail and share
strategies to resolve them.

Notes
1
Not all 1.3 million records are sent at once. Snowflake has a vectorized
execution engine. Data is processed in a pipelined fashion, with batches of
a few thousand rows in columnar format at a time. This is what allows an
XSMALL warehouse with 16GB of RAM to process datasets much larger
than 16GB. ↩

2
I wouldn't pay too much attention to what this query is calculating or the
way it's written. It was created solely for the purpose of yielding an
interesting example query profile. ↩

3
For readers interested in improving their ability to read Snowflake Query
Profiles, you can use the example query from above to see how each CTE
maps to the different sections of the Query Profile.

Exclude and rename columns when using


SELECT * in Snowflake
Date
Sunday, December 18, 2022

Ian Whitestone

Co-founder & CEO of SELECT

Why do we need EXCLUDE & RENAME?


In Snowflake's November 2022 release, they quietly announced some exciting new SQL
syntax, EXCLUDE and RENAME, which allows users to remove and rename specific columns
when running a SELECT * style query. This is particularly exciting, since it is very common
when operating in non-production SQL workflows to run SELECT * style exploratory queries
like:

select *

from table
where id = 5

Users often do this in order to investigate a particular record or subset of records. But what
happens when you want to exclude a specific column (like a large text column), or rename
one of them? Users are forced to type out all of the columns they want, along with any
desired renaming using the traditional AS syntax:

select

column_1,

column_2,

column_3 as column_3_renamed,

column_4,

column_5,

column_6,

column_7,

column_8,

column_9, -- columns 10 and 11 are intentionally left out

column_12

from table

where id = 5

Not the best when you're trying to get answers, quickly.

How to exclude columns when running a SELECT * SQL


statement in Snowflake?
Instead of typing out each column, a common ask 1 from users is the ability to specify a
subset of columns to exclude from the table. In many cases, this is much more efficient to
write since the number of columns to be excluded is often much lower than the number to be
included. As of Snowflake's version 6.37 release, you can now include an EXCLUDE clause
in SELECT * style SQL queries where you specify one or more columns that you don't want
returned from the table:

select

exclude(column_10, column_11)

from table

where id = 5

This can also be used to exclude a single column:

select

exclude (column_10)

from table

where id = 5

Brackets are optional when excluding a single column. The SQL above can be rewritten as:

select

exclude column_10

from table

where id = 5

Using Snowflake's EXCLUDE with multiple tables


Another common need for excluding columns is when joining multiple tables. Imagine you
have a query like this 2:
select *

from orders

join customers

on orders.customer_id=customers.customer_id

join items

on orders.order_id=items.order_id

In this example, we have two order_id columns and two customer_id columns in the
output, since they are present in multiple tables. We can easily exclude these by changing our
SQL to:

select

orders.*,

customers.* exclude customer_id,

items.* exclude order_id

from orders

join customers

on orders.customer_id=customers.customer_id

join items

on orders.order_id=items.order_id

Snowflake's equivalent of BigQuery's & Databrick's EXCEPT


Snowflake's EXCLUDE syntax is similar to BigQuery's and Databrick's EXCEPT syntax:

-- Snowflake

select
*

exclude(column_10, column_11)

from table

-- BigQuery/Databricks

select

except(column_10, column_11)

from table

For anyone familiar with DuckDB, Snowflake follows the same EXCLUDE syntax that they
use. Both databases presumably chose to avoid using the EXCEPT keyword to provide this
functionality, since it is already used in set operations.

Should you use EXCLUDE in production SQL code?


While this new syntax is great for speeding up adhoc workflows, it is not recommended for
production SQL code. It's better to be explicit about the exact columns you are selecting.
Writing out all column names improves readability and auditability. Users reading your SQL
code can immediately understand which columns you are using from which table. It can also
lead to clearer errors. If a column is suddenly removed from a table, your SQL code will
immediately fail. This is much better than a downstream application suddenly erroring due to
a missing column.

Does using EXCLUDE in Snowflake result in a different query plan or


performance?
No. Using the EXCLUDE keyword to remove a few columns does not result in a different query
plan or query performance compared to explicitly writing out all columns. You can validate
this behaviour by running both types of queries on a dataset in your Snowflake account and
inspecting the query profile for each query.

How to rename columns when running a SELECT * SQL


statement in Snowflake?
Prior to the new RENAME functionality, users were forced to write out every column name as
soon as they needed to rename at least one when running a SELECT * query. Now, users can
easily select all columns while renaming a subset:

select

rename (column_3 as column_3_renamed, column_5 as column_5_renamed)

from table

where id = 5

This is much better than having having to list out every field just to rename one or two:

-- old method 👎

select

column_1,

column_2,

column_3 as column_3_renamed,

column_4,

column_5 as column_5_renamed,

column_6,

...

column_12

from table

where id = 5

Similar to EXCLUDE, this can also be done for a single column, brackets optional:
select

rename (column_3 as column_3_renamed)

from table

where id = 5

Combining EXCLUDE and RENAME


EXCLUDE and RENAME can easily be combined, as shown in the example below:

select

exclude(column_10, column_11)

rename column_3 as column_3_renamed

from table

where id = 5

Much better than the code from the beginning of the blog post:

select

column_1,

column_2,

column_3 as column_3_renamed,

column_4,

column_5,

column_6,
column_7,

column_8,

column_9, -- columns 10 and 11 are intentionally left out

column_12

from table

where id = 5

Notes
1
To give you a sense of demand, this stack overflow question, asked almost 14 years ago,
has 1.3 million views and over 1000 upvotes. ↩

2
Shoutout to Nate Sooter for motivating this example with his recent tweet! ↩

How to speed up range joins in Snowflake


by 300x
Date
Monday, January 16, 2023

Ian Whitestone

Co-founder & CEO of SELECT

Range joins and other types of non-equi joins are notoriously slow in most
databases. While Snowflake is blazing fast for most queries, it too suffers
from poor performance when processing these types of joins. In this post
we'll cover an optimization technique practitioners can use to speed up
queries involving a range join by up to 300x . 1
Before diving into the optimization technique, we'll cover some background
on the different types of joins and what makes range joins so slow in
Snowflake. Feel free to skip ahead if you're already familiar.

Equi-joins versus non-equi joins


An equi-join is a join involving an equality condition. Most users will
typically write queries involving one or more equi-join conditions.

select

...

from orders

join customers

on orders.customer_id=customers.id -- example equi-join condition

A non-equi join is a join involving an inequality condition. An example of


this could be finding a list of customers who have purchased the same
product:

select distinct

o1.customer_id,

o2.customer_id,

o1.product_id

from orders_items as o1

join orders_items as o2

on o1.product_id=o2.product_id -- equi-join condition

and o1.customer_id<>o2.customer_id -- non-equi join condition

Or finding all orders past a particular date for each customer:


select

...

from orders

inner join customers

on orders.customer_id=customers.id

and orders.created_at > customers.one_year_anniversary_date

What are range joins?


A range join is a specific type of non-equi join. They occur when a join
checks if a value falls within some range of values ("point in interval join"),
or when it looks for two periods that overlap ("interval overlap join").

Point in interval range join


An example of a point in interval range join would be calculating the
number of queries running each second:

select

seconds.timestamp,

count(queries.query_id) as num_queries

from seconds

left join queries

on seconds.timestamp between

date_trunc('second', queries.start_time) and date_trunc('second', queries.end_time)

group by 1
This join could also be based on derived timestamps. For example, find all
purchase events which occurred within 24 hours of users viewing the home
page:

select

...

from page_views

inner join events

on events.event_type='purchase' -- filter condition

and page_views.pathname = '/' -- filter condition

and page_views.user_id=events.user_id -- equi-join condition

and page_views.viewed_at < events.event_at -- range join condition

and dateadd('hour', 24, page_views.viewed_at) >= events.event_at -- range join condition

Interval overlap range join


Interval overlap range joins are when a query tries to match overlapping
periods. Imagine that for each browsing session on your landing page site
you need to find all other sessions that occured at the same time in your
application:

select

s1.session_id,

array_agg(s2.session_id) as concurrent_sessions

from landing_page_sessions as s1

inner join app_sessions as s2

on s1.end_time > s2.start_time


and s1.start_time < s2.end_time

group by 1

Why are range joins slow in Snowflake?


Range joins are slow in Snowflake because they get executed as cartesian
joins with a post filter condition. A cartesian join, also known as a cross join,
returns the cartesian product of the records between the two datasets
being joined. If both tables have 10 thousand records, then the output of
the cartesian join will be 100 million records! Practitioners often refer to

this as a join explosion 💥. Query execution can be slowed down significantly


when Snowflake has to process these very large intermediate datasets.

Let's use the "number of queries running per second" example from above
to explore this in more detail.

select

seconds.timestamp,

count(queries.query_id) as num_queries

from seconds

left join queries

on seconds.timestamp between

date_trunc('second', queries.start_time) and date_trunc('second', queries.end_time)

group by 1

Our seconds table contains 1 row per second, and the queries table has 1 row
per query. The goal of this query is to lookup which queries were running
each second, then aggregate and count.
When executing the join, Snowflake first creates an intermediate dataset
that is the cartesian product of the two input datasets being joined. In this
example, the seconds table is 7 rows and the queries tables is 4 rows, so the
intermediate dataset explodes to 28 rows. The range join condition that
performs the "point in interval" check happens after this intermediate
dataset is created, as a post-join filter. You can see a visualization of this
process in the image below (go here for a full screen, higher resolution
version).

Running this query on a 30 day sample of data with 267K queries took 12
minutes and 30 seconds. As shown in the query profile, the join is the clear
bottleneck in this query. You can also see the range join condition
expressed as an "Additional Join Condition":
How to optimize range joins in Snowflake
When executing range joins, the bottleneck for Snowflake becomes the
volume of data produced in the intermediate dataset before the range join
condition is applied as a post-join filter. To accelerate these queries, we
need to find a way to minimize the size of the intermediate dataset. This
can be accomplished by adding an equi-join condition, which Snowflake
can process very quickly using a hash join.2

Minimize the row explosion


While the principle behind this is intuitive, make our datasets smaller, it is
tricky in practice. How can we constrain the intermediate dataset prior to
applying the range join post-join filter? Continuing with the queries per
second example from above, it is tempting to add an equi-join condition on
something like the hour of each timestamp:

select
seconds.timestamp,

count(queries.query_id) as num_queries

from seconds

left join queries

on date_trunc('hour', seconds.timestamp)=date_trunc('hour', queries.start_time) -- NEW: equi-join condition

and seconds.timestamp -- range join condition

between date_trunc('second', queries.start_time) and date_trunc('second', queries.end_time)

group by 1

Promising, but the approach falls apart when the interval (query total run
time) is greater than 1 hour. Because the equi-join is on the hour the query
started in, all records in any subsequent hours wouldn't be counted.

This can be solved by creating an intermediate dataset, query_hours,


containing 1 row per query per hour the query ran in. It then becomes safe
to join on hour, since we'll have 1 row for each hour the query ran. No
records get inadvertently dropped.

with

query_hours as (

select

queries.*,

hours_list.timestamp as query_hour

from queries

inner join hours_list -- dataset containing 1 row per hour

on hours_list.timestamp between date_trunc('hour', queries.start_time) and date_trunc('hour',


queries.end_time)
)

select

seconds.timestamp,

count(queries.query_id) as num_queries

from seconds

left join query_hours as queries

on date_trunc('hour', seconds.timestamp)=queries.query_hour -- NEW: equi-join condition

and seconds.timestamp -- range join condition

between date_trunc('second', queries.start_time) and date_trunc('second', queries.end_time)

group by 1

You might have noticed that the query_hours CTE involves a range join itself -
won't that be slow? When applied for the right queries, the additional time
spent on the input dataset preparation will result in a much faster query
overall . Another concern may be that the query_hours dataset becomes
3

much larger than the original queries dataset, as it fans out to 1 row per
query per hour. Since most queries finish in well under 1 hour,
the query_hours dataset will be similar in size to the original queries dataset.

Adding the new equi-join condition on hours helps accelerate this range join
query by constraining the size of the intermediate dataset. However, this
approach is not ideal for a few reasons. Maybe hour isn't the best choice,
and something else should be used as a constraint. Additionally, how can
this approach be extended to support range joins involving other numeric
datatypes, like integers and floats?

Binned range join optimization


We can extend the ideas from above into a more generic approach via the
use of 'bins' . 4
By telling Snowflake to only apply the range join condition on smaller
subsets of data, the join operation is much faster. For each timestamp,
Snowflake now only joins queries that ran in the same hour, instead of
every query from all time.

Instead of limiting ourselves to predefined ranges like "hour", "minute" or


"day", we can instead use arbitrarily sized bins. For example, if most queries
run in under 2 seconds, we could bucket the queries into bins that span 2
seconds each.

The algorithm would look something like this:

1. Generate the bins and add bin numbers to each dataset


2. Add the equi-join condition constraint to the range join using bin_num ,
similar to what was done above with hour.
3. The intermediate dataset created is now much smaller.
4. As usual, Snowflake applies the range join as a post-join filter. This
time, much faster.

You can see a visualization of this process in the image below (go here for
a full screen, higher resolution version).
Example binned range join query
Bin numbers are just integers that represent a range of data. One way to
create them is to divide the number by the desired bin size. With
timestamps, we can first convert the timestamp to unix time, which is an
integer, before dividing:

-- for 60 second sized bins

select

timestamp,

floor(date_part(epoch_second, timestamp) / 60) as bin_num

We'll save this in a function, get_bin_number 5, to avoid repeating it each time.

Following the steps described above, we first need to generate the list of
applicable bins. This accomplished by using a generator to create a list of
integers, then filtering that list down to the desired start and end bin
numbers . 6

set bin_size_s = 60;

with

metadata as (

select

-- this would be a query against your desired time range

min(timestamp) as start_time,

max(timestamp) as end_time,

get_bin_number(start_time, $bin_size_s) as bin_num_start,

get_bin_number(end_time, $bin_size_s) as bin_num_end


from seconds

),

-- need a CTE with 1 row between bin_num_start and bin_num_end

-- have to first generate a massive list, then filter down since you can't pass in calculated values

-- when bins_base is 1 trillion takes 5 seconds to filter down. 106 ms for for 1 million

bins_base as (

select

seq4() as row_num

from table(generator(rowcount => 1e9))

),

bins as (

select

bins_base.row_num as bin_num

from bins_base

inner join metadata

on bins_base.row_num between metadata.bin_num_start and metadata.bin_num_end

),

Now we can add the bin number to each dataset. For the queries dataset,
we'll output a dataset with 1 row per query per bin that the query ran
within. For the seconds dataset, each timestamp will be mapped to a single bin.

queries_w_bin_number as (

select
start_time,

end_time,

warehouse_id,

cluster_number,

bins.bin_num

from queries

inner join bins

on bins.bin_num between

get_bin_number(queries.start_time, $bin_size_s) and get_bin_number(queries.end_time, $bin_size_s)

),

seconds_w_bin_number as (

select

timestamp,

get_bin_number(timestamp, $bin_size_s) as bin_num

from seconds

And apply the final join condition, with the added equi-join condition
on bin_num:

select

s.timestamp,

count(q.warehouse_id) as num_queries
from seconds_w_bin_number as s

left join queries_w_bin_number as q

on s.bin_num=q.bin_num

and s.timestamp between date_trunc('second', q.start_time) and date_trunc('second', q.end_time)

group by 1

Using the same dataset as above, this query executed in 2.2 seconds,
7

whereas the un-optimized version from earlier took 750 seconds. That's
over a 300x improvement. The query profile is shown below. Note how the
join condition now shows two sections: one for the equi-join condition
on bin_num, and another for the range join condition.

Choosing the right bin size


A key part of making this strategy work involves picking the right bin size.
You want each bin to contain a small range of values, in order to minimize
row explosion in the intermediate dataset before the range join filter is
applied. However, if the bin size you pick is too small, the size of your "right
table" (queries) will increase significantly when you fan it out to 1 row per
bin.

According to databricks, a good rule of thumb is to pick the 90th


percentile of your interval length. You can calculate this using
the approx_percentile function. I've shown the values for the queries sample
dataset I've been using throughout this post.

select

approx_percentile(datediff('second', start_time, end_time), 0.5) as p50, -- 2s

approx_percentile(datediff('second', start_time, end_time), 0.90) as p90, -- 30s

approx_percentile(datediff('second', start_time, end_time), 0.95) as p95, -- 120s

approx_percentile(datediff('second', start_time, end_time), 0.99) as p99, -- 600s

approx_percentile(datediff('second', start_time, end_time), 0.999) as p999, -- 900s

count(*) -- 267K

from queries

Rules of thumb are not perfect. If possible, test your query with a few
different bin sizes and see what performs best. Here's the performance
curve for the query above, using different bin sizes. In this case, picking the
99.9th percentile versus the 90th percentile didn't make much of a
difference. As expected, query times started to get worse once the bin size
got really small.
How to extend to a join with a fixed interval?

 Explain how this would be extended to a fixed interval point in


interval join
 Bin size would be set to fixed interval size

If you have a point in interval range join with a fixed interval size, like the
query shared earlier:

select

...

from page_views

inner join events

on events.event_type='purchase' -- filter condition

and page_views.pathname = '/' -- filter condition

and page_views.user_id=events.user_id -- equi-join condition


and page_views.viewed_at < events.event_at -- range join condition

and dateadd('hour', 24, page_views.viewed_at) >= events.event_at -- range join condition

Then set your bin size to the size of the interval: 24 hours.

How to extend to an interval overlap range join?


If you are dealing with an interval overlap range join, like the one shown
below:

select

s1.session_id,

array_agg(s2.session_id) as concurrent_sessions

from landing_page_sessions as s1

inner join app_sessions as s2

on s1.end_time > s2.start_time

and s1.start_time < s2.end_time

group by 1

You can apply the same binned range join technique after you've fanned
out both landing_page_sessions and app_sessions to contain 1 row per session
per bin the session fell within (as was done with queries above).

When should this optimization be used?


As a first step, ensure that the range join is actually a bottleneck by using
the Snowflake query profile to validate it is one of the most expensive
nodes in the query execution. Adding the binned range join optimization
does make queries harder to understand and maintain.

The binned range join optimization technique only works for point in
interval and interval overlap range joins involving numeric types. It will not
work for other types of non-equi joins, although you can apply the same
principle of trying to add an equi-join constraint wherever possible to
reduce the row explosion.

If the dataset on the "right", with the start and end times, contains a
relatively flat distribution of interval sizes, then this technique won't be as
effective.

Notes
1
This stat is from a single query, so take it with a handful of salt. Your
mileage will vary depending on many factors. ↩

2
This approach was inspired by Simeon Pilgrim's post from 2016 (back
when Snowflake was snowflake.net!). I used it quite successfully until
implementing the more generic binning approach. ↩

3
The range join to the hours table will be much quicker than the range join
to the seconds table, since the intermediate table will be ~3600 times
smaller. ↩

4
This approach was inspired by Databricks. They don't go into the details
of how their algorithm is implemented, but I assume it works in a similar
fashion. ↩

5
Optionally create a get_bin_number function to avoid copying the same
calculation throughout the query:

create or replace function get_bin_number(timestamp timestamp_tz, bin_size_s integer)

returns integer

as

$$

floor(date_part(epoch_second, timestamp) / bin_size_s)

$$

6
Snowflake doesn't let you pass in calculated values to the generator, so
this had to be done in a two step process. In the near future we'll be open
sourcing some dbt macros to abstract this process away. ↩

7
The full example binned range join optimization query:

create or replace function get_bin_number(timestamp timestamp_tz, bin_size_s integer)

returns integer

as

$$

floor(date_part(epoch_second, timestamp) / bin_size_s)

$$

set bin_size_s = 60;

with

metadata as (

select

-- Get the time range your query will span

min(timestamp) as start_time,

max(timestamp) as end_time,

get_bin_number(start_time, $bin_size_s) as bin_num_start,


get_bin_number(end_time, $bin_size_s) as bin_num_end

from seconds

),

bins_base as (

select

seq4() as row_num

from table(generator(rowcount => 1e9))

),

bins as (

select

bins_base.row_num as bin_num

from bins_base

inner join metadata

on bins_base.row_num between metadata.bin_num_start and metadata.bin_num_end

),

queries_w_bin_number as (

select

start_time,

end_time,

warehouse_id,

cluster_number,
bins.bin_num

from queries

inner join bins

on bins.bin_num between

get_bin_number(queries.start_time, $bin_size_s)

and get_bin_number(queries.end_time, $bin_size_s)

),

seconds_w_bin_number as (

select

timestamp,

get_bin_number(timestamp, $bin_size_s) as bin_num

from seconds

select

s.timestamp,

count(q.warehouse_id) as num_queries

from seconds_w_bin_number as s

left join queries_w_bin_number as q

on s.bin_num=q.bin_num

and s.timestamp between date_trunc('second', q.start_time) and date_trunc('second', q.end_time)

group by 1
;


3 ways to configure Snowflake warehouse
sizes in dbt
Date
Wednesday, January 18, 2023

Niall Woodward

Co-founder & CTO of SELECT

The ability to use different warehouse sizes for different workloads in Snowflake provides
enormous value for performance and cost optimization. dbt natively integrates with
Snowflake to allow specific warehouses to be chosen down to the model level. In this post,
we explain exactly how to use this feature and share some best practices.

Why change warehouse size?


If your dbt project is starting to take a long time to run, causing SLA misses, or just a bad
user experience, then increasing the warehouse size will likely improve its speed. Or, if
you’ve increased dbt’s default warehouse size already, you may be looking to reduce
costs by increasing the size for only the models that benefit from it.

Speed up dbt models


Models often take longer as the volume of data increases over time. For some models, this
slowdown will be linear with the volume of data. For other models, that’s not always the
case, as aggregate functions and joins can become exponentially compute-intensive as the
volume of data grows. All these effects can impact execution time significantly especially if a
model starts spilling to remote storage (check out our post the on query profile to learn
more).

The single most effective way of speeding up any query is to reduce the amount of data it
processes. If a model is becoming slow and uses a table materialization, consider the
possibility of using an incremental materialization to process only new or updated data each
time it runs.

If you’re already using incrementalization, or it’s not possible, then increasing the warehouse
size is likely the next best step for speeding up the model.

Configuring dbt’s default Snowflake warehouse


By default, dbt uses the Snowflake warehouse configured in the profiles.yml entry for the
project, or if not set, the default warehouse for dbt’s Snowflake user. By changing this
warehouse for another, larger warehouse, or simply resizing the existing warehouse, dbt will
now execute all its queries on a larger warehouse. Depending on where dbt is running the
warehouse can be changed in the profiles.yml file, or in dbt Cloud at the environment
level.

profiles.yml
In your profiles.yml file, edit the warehouse config to a different virtual warehouse. The
size of each warehouse is configured in Snowflake.

select_internal:

outputs:

dev:

type: snowflake

account: org.account

user: niall

password: XXXXX

warehouse: dev

database: dev

schema: niall
threads: 8

target: dev

dbt Cloud
In dbt Cloud, navigate to Deploy > Environments in the top menu bar. Choose the
environment you want to edit, then Settings. Click Edit, then scroll down to Deployment
Connection where the warehouse can be changed. The size of the warehouse is configured in
Snowflake.

Changing the default dbt warehouse size isn’t necessarily wise, however, as most queries in
the project will not benefit from the increased warehouse size, leading to increased
Snowflake costs. For more details on the impact of warehouse size on query speed, see
our warehouse sizing post. Instead of increasing the default warehouse size, we recommend
setting the default warehouse size to X-Small, and overriding the warehouse size at the
individual model level as needed.

Configuring a dbt model's Snowflake warehouse size


Configuring the warehouse used at the model level means that we can choose specific
warehouse sizes for each model’s requirements, optimizing performance and cost.

Hardcoded warehouse
dbt provides the snowflake_warehouse model configuration, which looks like this when set
in a specific model:
{{ config(

snowflake_warehouse="dbt_large"

) }}

select

...

from {{ ref('stg_orders') }}

Alternatively, the configuration can be applied to all models in a directory


using dbt_project.yml , such as:

name: my_project

version: 1.0.0

---

models:

+snowflake_warehouse: 'dbt_xsmall'

my_project:

clickstream:

+snowflake_warehouse: 'dbt_large'

Dynamic warehouse name based on environment


In many cases, the warehouse we want to use in production differs from the one that we
might want to use when developing, or in a CI or automated testing workflow. The same goes
for the warehouses we’re now configuring for individual models. To do that, we can use a
macro in place of the literal warehouse value we have before:
{{ config(

snowflake_warehouse=get_warehouse('large')

) }}

select

...

from {{ ref('stg_orders') }}

This macro can implement logic to return the desired warehouse size for the environment.

Suppose that in production, we have created warehouses


named dbt_production_<size> (dbt_production_xsmall , dbt_production_small , dbt_pr
oduction_medium etc.), and for CI, we have warehouses named dbt_ci_<size> . For local
development, we just want to use the default warehouse, and ignore the configured
warehouse size altogether. We also want to raise an error if the chosen warehouse size isn’t
available in a managed list. We can do that with the following macro logic:

{% macro get_warehouse(size) %}

{% set available_sizes = ['xsmall', 'small', 'medium', 'large', 'xlarge', '2xlarge'] %}

{% if size not in available_sizes %}

{{ exceptions.raise_compiler_error("Warehouse size not one of " ~ valid_warehouse_sizes) }}

{% endif %}

{% if target.name in ('production', 'prod') %}

{% do return('dbt_production_' ~ size) %}

{% elif target.name in ('ci') %}

{% do return('dbt_ci_' ~ size) %}

{% else %}
{% do return(None) %}

{% endif %}

{% endmacro %}

Using a macro for the snowflake_warehouse config only works in model files, and cannot be
used in the dbt_project.yml .

Configuring the warehouse for other resources in dbt


It’s currently only possible to configure warehouses for models and snapshots. If you’re
interested in keeping up to date on support for other resources such as tests, take a look at
this GitHub issue.

Monitoring dbt model performance and cost


Check out our dbt_snowflake_monitoring dbt package which provides an easy-to-
use dbt_queries model for understanding the performance of dbt models over time. It
attributes costs to individual model runs, which makes it easy to answer questions like “what
are the 10 costliest models in the last month?”.

Conclusion
Thanks for reading! In an upcoming post we’ll share more recommendations for optimizing
dbt performance on Snowflake. Make sure to subscribe for notifications on future posts, and
feel free to reach out if you have any questions!

Snowflake query tags for


enhanced monitoring
Date
Wednesday, February 8, 2023

Ian Whitestone
Co-founder & CEO of SELECT

Snowflake query tags allow users to associate arbitrary metadata with each
query. In this post, we show how you can use query tags to achieve better
visibility & monitoring for your Snowflake query costs and performance.

What is a query tag in Snowflake?


Query tags are an optional session level parameter that allow users to tag
any Snowflake SQL statement with a string. They can be up to 2000
characters in length, and contain any characters. Query tag values for each
query can be found in the output of the query_history views in Snowflake,
allowing them to be leveraged for a variety of use cases.

What about object tagging?


It's worth noting that query tags are different than object tags. Both
accomplish the shared purpose of enabling more structured monitoring
and improving visibility within your Snowflake account. However, object
tags are used for persistent account objects like users, roles, tables, views,
functions and more.

Why should you use query tags in Snowflake?


For most Snowflake users, compute costs from queries running in virtual
warehouses will make up the vast majority of their Snowflake spend. While
it is possible to attribute compute spend within a warehouse to different
users by calculating cost per query, this is often not granular enough
since a single production user account can generate the vast majority of
queries, and costs.

Query tags enable more fine grained cost attribution. If you have a single
SQL statement, or series of SQL statements associated with a data model in
a pipeline, you can assign them the same query tag. Costs can then be
easily attributed to all queries associated with the given tag. The alternative
to this involves grouping by query_text, which does not allow for multiple
related SQL statements to be bucketed together. It also falls apart when the
SQL text for a given data model inevitably gets changed.
Query tags can also be used for more granular query
performance monitoring. Sticking with the example from earlier, users may
wish to monitor the total runtime for each data model by grouping
together the total elapsed time for all associated queries. Alternatively, if a
set of queries are used to power user facing application dashboards,
leveraging query tags can allow for more targeted performance monitoring.

Last of all, query tags provide the ability to link queries with metadata from
other systems. A query tag could contain a dashboard_id, which could enable
users to aggregate all costs for a single dashboard, and then see how often
that dashboard is used through the BI tool's metadata.

How do you use query tags in Snowflake?


Setting default tags
Query tags are a session level parameter, but defaults can be set at the
account and user level. For example:

alter user shauna set query_tag = '{"team": "engineering", "user": "shauna"}';

Every query issued by this user will now have this default tag.

Setting query tags at the session level


Use the alter session command to set the query tag. After this command is
run, all subsequent queries run in the same session will be tagged with
that string.

alter session set query_tag='users_model';

-- this query will be tagged with 'users_model'

create or replace table users_tmp as (

select *
from raw_users

where

not deleted

and created_at > current_date - 1

);

-- this will also be tagged with 'users_model'

insert into users

from users_tmp

alter session set query_tag='orders_model';

-- this will get tagged with 'orders_model'

create or replace table orders as (

select *

from raw_orders

where

not deleted

);

We recommend using user defaults wherever possible, to avoid


frequent alter session calls which add latency to the overall query execution.
Setting query tags in Python
If you are using Python to execute queries, there are two ways to set query
tags.

Setting once upon connection creation


When you are creating your connection object with the Snowflake Python
Connector, you can set any session parameters upfront. In the example
below, all queries executed from this con object will be tagged
with DATA_MODELLING_PIPELINE .

con = snowflake.connector.connect(

user='XXXX',

password='XXXX',

account='XXXX',

session_parameters={

'QUERY_TAG': 'DATA_MODELLING_PIPELINE',

Modifying manually through alter session


If you don't want all queries in your session to have the same tag, you can
instead run alter session set query_tag = 'XXX' prior to running your actual
queries.

con.cursor().execute("alter session set query_tag='users_model'")

query = """

create or replace table users_tmp as (


select *

from raw_users

where

not deleted

and created_at > current_date - 1

"""

con.cursor().execute(query) # tagged with 'users_model'

con.cursor().execute("insert into users from users_tmp") # tagged with 'users_model'

con.cursor().execute("alter session set query_tag='orders_model'");

query = """

create or replace table orders as (

select *

from raw_orders

where

not deleted

"""
con.cursor().execute(query) # tagged with 'orders_model'

Setting query tags in dbt


If you are using dbt, there are three options for setting query tags:

1. They can be set once in your profiles.yml (source). All queries run in
your dbt project will then be tagged with that value.
2. Tags can be set for all models under a particular resource_path, or for a
single model, by adding a +query_tag in your dbt_project.yml. For
individual models, you can also specify the query tag in the model
config, i.e. {{ config(query_tag = 'XXX') }} . If a default query tag has
been set in profiles.yml, it will be overridden by any of these more
precise tags.
3. You can create a set_query_tag macro which automatically sets the
query tag to the model name for all models in your project.

Refer to the dbt documentation for examples of each of these options,


and do make note of the potential failure mode they listed where queries
can be set with an incorrect tag if specific failures occur upstream.

We have recently released a new dbt package dbt-snowflake-query-


tags to tag all dbt-issued queries with a comprehensive set of metadata,
check it out.

Using JSON strings


When setting query tags, we recommend using a JSON object for ease-of-
use and consistency. Continuing with the data model tagging example from
earlier, we could use a JSON object to additional information like the
environment the model ran in, the version, the trigger for the data model to
run (was it a scheduled run or manually invoked?), and more.

import json

query_tag = {
'app_name': 'pipeline',

'model_name': 'users',

'environment': 'prod',

'version': 'v1.2',

'trigger': 'schedule'

con.cursor().execute(f"alter session set query_tag='{json.dumps(query_tag)}'")

con.cursor().execute(model_sql)

How to use query tags for Snowflake cost &


performance monitoring
Query tags are surfaced in the query history views for each query_id. Here's
an example query which shows average query performance by query_tag:

select

query_tag,

count(*) as num_executions,

avg(total_elapsed_time/1000) as avg_total_elapsed_time_s

from snowflake.account_usage.query_history

where

start_time > current_date - 7

group by 1
If the query_tag contains a JSON object, it can be parsed and segmented by
any of the keys. Using the example from above:

select

try_parse_json(query_tag)['model_name']::string as model_name,

count(*) as num_executions,

avg(total_elapsed_time/1000) as avg_total_elapsed_time_s

from snowflake.account_usage.query_history

where

try_parse_json(query_tag)['app_name']::string = 'pipeline'

and start_time > current_date - 7

group by 1

Using the dbt-snowflake-monitoring package


If you are using SELECT's dbt package for cost & performance
monitoring, you can analyze cost by query tag in addition to the
performance:

select

try_parse_json(query_tag)['model_name']::string as model_name,

count(*) as num_executions,

sum(query_cost) as total_cost,

avg(total_elapsed_time_s) as avg_total_elapsed_time_s

from query_history_enriched

where
try_parse_json(query_tag)['app_name']::string = 'pipeline'

and start_time > current_date - 7

group by 1

You'll also find that these queries run much quicker than a query against
Snowflake's account usage views, since the table is materialized and sorted
by start_time to achieve a well clustered state.

Using query comments instead of query tags


Another common practice for tagging queries is to add a comment to the
bottom of each query:
1

create or replace table orders as (

select *

from raw_orders

where

not deleted

);

-- '{"model_name": "orders", "environment": "prod", "version": "v1.2", "app_name": "pipeline", "trigger":


"schedule"}'

This has the advantage of being universally applicable to all data


warehouses, and is easier to implement since it doesn't require running
an alter session statement. Another desireable advantage is performance,
since running an alter session statement involves a round trip network call
to Snowflake . This is fine for most use cases, but may not be acceptable in
2

applications where an additional 100-200ms in response time matters.


Lastly, since the query text can be up to 1MB in size, query comments can
contain much more metadata than query tags, which are limited to 2000
characters.

Where possible, we recommend using query tags since they are much
simpler to parse and analyze downstream. If it's possible for your query
metadata to exceed 2000 characters, stick with query comments.

1
Snowflake automatically removes any comments at the beginning of each
query, so you must append them to the end of the query.

2
The alter session statement itself is extremely fast, taking about 30ms on
average.
Monitoring dbt model spend and
performance with metadata
Date
Friday, February 24, 2023

Ian Whitestone

Co-founder & CEO of SELECT

In a previous post, we covered how Snowflake query


tags & comments allow users to associate arbitrary metadata with each
query. In this post, we show how you can add query tags or comments to
your dbt models in order to track their spend or performance over time.

Why track dbt model spend & performance?


dbt has skyrocketed in popularity over the last 5 years, becoming the most
popular framework for building and managing data models within the data
warehouse. Performing lots of data transformations on large datasets
within the warehouse is not cheap, however. Whether using dbt, or any
other SQL-based transformation tool, the costs associated with these
transformations tend to make up a significant portion of Snowflake
customer's compute spend.

As customers look to better understand, monitor, and reduce their data


cloud spend, it has become increasingly important to get more insight into
the spend associated with each dbt model. Additionally, as customers
increasingly use dbt to power business critical applications and decision
making, it becomes necessary for customers to monitor model performance
in order to ensure that SLAs are met.

When using Snowflake and dbt, customers do not get these crucial
monitoring features out of the box. By adding metadata to their dbt models
through query tags or comments, customers can achieve these core
monitoring abilities.

Setting query tags in dbt


In our post on query tags, we outlined the three options for setting query
tags in dbt:

1. Setting it globally in your profiles.yml


2. Adding a query_tag for each model in your dbt_project.yml or in the
model config
3. Creating a set_query_tag macro to dynamically set the query tag for
each model in your project.

Approach #3 is by far the best option as it avoids having users manually set
the tags. If you'd like to get started with dynamically setting query tags for
each model, you can implement a custom macro like the one here to add
detailed metadata to each query issued by dbt.

Setting query comments in dbt (recommended


approach)
For dbt related metadata, we recommend using query comments instead
of query tags since the computer generated metadata can occasionally
exceed the 2000 character limit of query tags.
dbt provides this setting out of the box. In your dbt_project.yml , you can
add the following:

query-comment:

append: true # Snowflake removes prefixed comments

This will add a query comment to the bottom of your query:

create view analytics.analytics.orders as (

select ...

);

/* {"app": "dbt", "dbt_version": "0.15.0rc2", "profile_name": "debug",

"target_name": "dev", "node_id": "model.dbt2.my_model"} */

In order to add more comprehensive metadata in the query comment, you


can install our dbt-snowflake-monitoring query package. This package
makes the following metadata available for all dbt queries:

"dbt_snowflake_query_tags_version": "2.0.0",

"app": "dbt",

"dbt_version": "1.4.0",

"project_name": "my_project",

"target_name": "dev",

"target_database": "dev",

"target_schema": "larry_goldings",
"invocation_id": "c784c7d0-5c3f-4765-805c-0a377fefcaa0",

"node_name": "orders",

"node_alias": "orders",

"node_package_name": "my_project",

"node_original_file_path": "models/staging/orders.sql",

"node_database": "dev",

"node_schema": "mart",

"node_id": "model.my_project.orders",

"node_resource_type": "model",

"materialized": "incremental",

"is_incremental": true,

"node_refs": [

"raw_orders",

"product_mapping"

],

"dbt_cloud_project_id": "146126",

"dbt_cloud_job_id": "184124",

"dbt_cloud_run_id": "107122910",

"dbt_cloud_run_reason_category": "other",

"dbt_cloud_run_reason": "Kicked off from UI by niall@select.dev",

}
Using this info, you can can monitor cost and performance by a variety of
interesting dimensions, such as dbt project, model name, environment (dev
or prod), materialization type, and more.

Monitoring dbt model performance


When using query tags, you can monitor your dbt model performance
using a variation of the query below:

select

date_trunc('day', start_time) as date,

try_parse_json(query_tag)['model_name']::string as model_name,

count(distinct try_parse_json(query_tag)['invocation_id']) as num_executions,

avg(total_elapsed_time/1000) as avg_total_elapsed_time_s,

approx_percentile(total_elapsed_time/1000, 0.95) as p95_total_elapsed_time_s,

avg(execution_time/1000) as avg_execution_time_s,

approx_percentile(execution_time/1000, 0.95) as p95_execution_time_s

-- optionally repeat for other query time metrics, like:

-- compilation_time, queued_provisioning_time, queued_overload_time, etc.

from snowflake.account_usage.query_history

where
try_parse_json(query_tag)['app']::string = 'dbt'

and start_time > current_date - 30

group by 1,2

If you're using query comments, you'll first have to parse out the metadata
from the comment text:

with

query_history as (

select

*,

regexp_substr(query_text, '/\\*\\s({"app":\\s"dbt".*})\\s\\*/', 1, 1, 'ie') as _dbt_json_meta,

try_parse_json(_dbt_json_meta) as dbt_metadata

from snowflake.account_usage.query_history

select

date_trunc('day', start_time) as date,

dbt_metadata['model_name']::string as model_name,

count(distinct dbt_metadata['invocation_id']) as num_executions,

avg(total_elapsed_time/1000) as avg_total_elapsed_time_s,

approx_percentile(total_elapsed_time/1000, 0.95) as p95_total_elapsed_time_s,


avg(execution_time/1000) as avg_execution_time_s,

approx_percentile(execution_time/1000, 0.95) as p95_execution_time_s

-- optionally repeat for other query time metrics, like:

-- compilation_time, queued_provisioning_time, queued_overload_time, etc.

...

from query_history

where

dbt_metadata is not null

and start_time > current_date - 30

group by 1,2

dbt-snowflake-monitoring to the rescue!


If you're using our dbt-snowflake-monitoring package, this query
comment parsing is automatically done for you.

Both queries above only look at the query run times. A number of other
metrics you can monitor in conjunction are:

 partitions_scanned & partitions_total : informs you whether the queries


are efficiently pruning out micro-partitions
 bytes_scanned : gives you an idea of how much data is being processed
over time, which may explain increased run times
 bytes_spilled_to_local_storage & bytes_spilled_to_remote_storage : indicates
whether your model may benefit from running on a larger warehouse
 queued_overload_time : indiciates whether you may need to increase
you max_cluster_count on the warehouse

Monitoring dbt model spend


In order to monitor dbt model costs over time, first calculate the cost of
each Snowflake query. Next, the query costs can be used to aggregate
spend by model:

select

date_trunc('day', start_time) as date,

try_parse_json(query_tag)['model_name']::string as model_name,

count(distinct try_parse_json(query_tag)['invocation_id']) as num_executions,

sum(query_cost) as total_cost,

total_cost / num_executions as avg_cost_per_execution

from query_history_w_costs

where

try_parse_json(query_tag)['app']::string = 'dbt'

and start_time > current_date - 30

group by 1,2

If you'd like to automatically get a query history with the costs for each
query added, you can install our dbt-snowflake-monitoring package.

Use SELECT for dbt model cost & performance


monitoring
At SELECT, we leverage query tags & comments to allow our customers to
monitor their dbt-issued queries by a variety of dimensions: environment,
materialization, resource type, etc. Gaining this visibility is incredibly
valuable, as it allows data teams to easily prioritize which dbt models
require some extra attention. An example of how we surface this
information in the product is shown below:

If you're looking to cut your Snowflake costs, want to get a better picture of
what's driving them, or just want to keep a pulse on things, you can get
access today or book a demo using the links below.
Should you use CTEs in Snowflake?
Date
Wednesday, March 15, 2023

Niall Woodward

Co-founder & CTO of SELECT

CTEs are an extremely valuable tool for modularizing and reusing SQL logic.
They're also a frequent focus of optimization discussions, as their usage has
been associated with unexpected and sometimes inefficient query
execution. In this post, we dig into the impact of CTEs on query plans,
understand when they are safe to use, and when they may be best avoided.

Introduction
Much has been written about the impact of CTEs on performance in the
past few years:

 CTEs are pass-throughs - Tristan Handy - 2018-11-07

 Snowflake query optimiser: unoptimised - Dominik Golebiewski -


2021-10-13

 CTE Considerations - bennieregenold7 - 2022-07-22

 A recent dbt Slack thread - 2023-02-22

But the fact we’re still seeing so much discussion shows we’ve not reached
a conclusion yet. This post aims to provide a reasoned set of guidelines for
when you should use CTEs, and when you might want to avoid them.
Snowflake's query optimizer is being continuously improved, and like in the
posts linked above, the behavior observed in this post will change over
time.

We'll make use of query profiles to understand the impact of different


query designs on execution. If query profiles are new to you or you'd like a
refresher, check out our post on how to use Snowflake's query profile.

Let’s start with a recap of what CTEs are and why they’re popular.

What are CTEs?


A CTE, or common table expression, is a subquery with a name attached.
They are declared using a with clause, and can then be selected from using
their name identifier:

with my_cte as (

select 1

select * from my_cte

CTEs are comma delimited, meaning we can define several by separating


them with commas:

with my_cte as (

select 1

),

my_cte_2 as (

select 2

)
select *

from my_cte

left join my_cte_2

We can also put CTEs inside CTEs if we so wish (though things get a little
hard to read!):

with my_cte as (

with my_inner_cte as (

select 1

select * from my_inner_cte

select *

from my_cte

Why use CTEs?


The main reasons to use CTEs are:

1. CTEs can help separate SQL logic into separate, isolated subqueries.
That makes debugging easier as you can simply select * from
cte to run a CTE in isolation.
2. CTEs provide a way of writing procedural-like SQL in a top-to-bottom
style, which can help with code review and maintainability.
3. CTEs can help conform to the DRY (don’t repeat yourself) principle,
providing a single place to define logic that is referenced multiple
times downstream.

How does Snowflake treat CTEs in the query


plan?
In order to understand the performance implications of CTEs, we first need
to understand how Snowflake handles CTE declarations in a query’s
execution.

Are CTEs pass-throughs?


Yes, so long as the CTE is referenced only once. By pass-through, we mean
that the query gets processed in the same way whether or not the CTE is
used. When a CTE is referenced only once, it’s always a pass-through, and
the query profile shows no sign of it whatsoever. Therefore, using a CTE
that’s referenced only once will never impact performance vs avoiding the
CTE.

with sample_data as (

select *

from snowflake_sample_data.tpch_sf1.customer

select *

from sample_data

where c_nationkey = 14
But if we reference that CTE more than once we see something different,
and the query’s execution is now different to if we’d referenced the table
directly rather than use a CTE.

with sample_data as (

select *

from snowflake_sample_data.tpch_sf1.customer

),
nation_14_customers as (

select *

from sample_data

where c_nationkey = 14

),

nation_9_customers as (

select *

from sample_data

where c_nationkey = 9

select *

from nation_14_customers

union all

select *

from nation_9_customers
We see two new node types, the WithClause and the WithReference .
The WithClause represents an output stream and buffer from
the sample_data CTE we defined, which is then consumed by
each WithReference node. Note that Snowflake is intelligently ‘pushing
down’ the filter in the nation_14_customers and nation_9_customers CTEs
to the TableScan before the WithClause. Previously, Snowflake didn’t do
this, as reported in Dominik’s post. It’s worth checking that this behavior
applies to more complex queries, but for this query, the profile is the same
as if we’d written the query like:
with sample_data as (

select *

from snowflake_sample_data.tpch_sf1.customer

where c_nationkey in (14, 9)

),

nation_14_customers as (

select *

from sample_data

where c_nationkey = 14

),

nation_9_customers as (

select *

from sample_data

where c_nationkey = 9

select *

from nation_14_customers

union all
select *

from nation_9_customers

Let's now replace the sample_data CTE references with a


direct snowflake_sample_data.tpch_sf1.customer table reference, and see
the differences in execution plan:

with nation_14_customers as (

select *

from snowflake_sample_data.tpch_sf1.customer

where c_nationkey = 14

),

nation_9_customers as (

select *

from snowflake_sample_data.tpch_sf1.customer

where c_nationkey = 9

select *

from nation_14_customers

union all

select *

from nation_9_customers
The differences are:

 Two TableScans instead of one. The TableScan on the left performs


the read from remote storage, and the TableScan on the right uses
the local warehouse-cached result of the one on the left. While there
are two TableScans, only one of them performs any remote data
fetching.
 Two Filters instead of three, though where a filter is applied after
a TableScan, the TableScan node itself takes care of the filtering,
which is why the input and output row counts of the filter are the
same.
 No WithClause or WithReference nodes.

Now we understand how CTEs are translated into an execution plan, let's
explore the performance implications.
Sometimes, it’s faster to repeat logic than re-use
a CTE
Most of the time, Snowflake’s strategy of computing a CTE's result once
and distributing the results downstream is the most performant strategy.
But in some circumstances, the cost of buffering and distributing the CTE
result to downstream nodes exceeds that of recomputing it, especially as
the TableScan nodes use cached results anyway.

Here’s a contrived example, referencing the lineitems CTE three times:

with lineitems as (

select *

from snowflake_sample_data.tpch_sf100.lineitem

where l_receiptdate > '1998-01-01'

),

lineitem_future_sales as (

select

a.l_orderkey,

a.l_linenumber,

sum(b.l_quantity) as future_part_order_total

from lineitems as a

left join lineitems as b

on a.l_partkey = b.l_partkey

and b.l_receiptdate > a.l_receiptdate


group by 1, 2

select *

from lineitems

left join lineitem_future_sales on

lineitems.l_orderkey = lineitem_future_sales.l_orderkey

and lineitems.l_linenumber = lineitem_future_sales.l_linenumber

Across three runs, this query took an average of 1m 17s to complete on a


small warehouse. Here’s an example profile:
If we instead re-write the query to repeat the lineitems CTE as a subquery:

with lineitem_future_sales as (

select

a.l_orderkey,
a.l_linenumber,

sum(b.l_quantity) as future_part_order_total

from (

select *

from snowflake_sample_data.tpch_sf100.lineitem

where l_receiptdate > '1998-01-01'

) as a

left join (select *

from snowflake_sample_data.tpch_sf100.lineitem

where l_receiptdate > '1998-01-01'

) as b

on a.l_partkey = b.l_partkey

and b.l_receiptdate > a.l_receiptdate

group by 1, 2

select *

from (

select *

from snowflake_sample_data.tpch_sf100.lineitem

where l_receiptdate > '1998-01-01'


) as lineitems

left join lineitem_future_sales on

lineitems.l_orderkey = lineitem_future_sales.l_orderkey

and lineitems.l_linenumber = lineitem_future_sales.l_linenumber

The query takes an average of 1m 7s across three runs, a roughly 10%


speed improvement. Query profile:

lineitems is a simple CTE. When a CTE reaches a certain level of complexity,


it’s going to be cheaper to calculate the CTE once and then pass its results
along to downstream references rather than re-compute it multiple times.
This behavior isn’t consistent though (as we saw with the basic example
in Are CTEs pass-throughs), so it’s best to experiment. Here’s a way to
visualize the relationship:

Recommendation
CTEs can be used with confidence in Snowflake, and a CTE that’s referenced
only once will never impact performance. Aside from some very specific
examples like the above, computing the CTE once and re-using it will yield
the best performance vs repeating the CTEs logic. In the previous section,
we’ve seen that Snowflake will intelligently push down filters into CTEs to
avoid unnecessary full table scans.

If, however, you’re working on optimizing a specific query where


performance and cost efficiency are paramount and it’s worth spending
time on, experiment by repeating the CTE’s logic. The CTE’s logic can either
be repeated in multiple subqueries, or it can be defined in a view and
referenced multiple times like the CTE was.

In some scenarios, CTEs prevent column pruning


In previous posts, we covered the unique design of Snowflake’s micro-
partitions and how they enable a powerful optimization called micro-
partition pruning. Due to their columnar storage format, they also enable
column pruning. This is important because it means that only the columns
selected in a query need to be retrieved over the network.

Column pruning always works when a CTE is referenced once (when CTEs
are referenced only once they are treated as if they don’t exist). In a simple
case:

with sample_data as (

select *

from snowflake_sample_data.tpch_sf1.customer

select c_name, c_address

from sample_data

We can see that only the two columns that were selected were read from
the underlying table. But as we know from before, a CTE which is
referenced only once is a pass-through, and is compiled into a query plan
agnostic of its existence.
Column pruning stops working when a CTE is referenced
more than once
This time, let’s reference the CTE twice, selecting a single column in each
CTE reference.

with sample_data as (

select *

from snowflake_sample_data.tpch_sf1.customer

),

customer_names as (

select c_name

from sample_data

),

customer_addresses as (

select c_address

from sample_data

select c_name

from customer_names

union all
select c_address

from customer_addresses

Snowflake unfortunately has not pushed down the column references to


the underlying table scan. Here’s the full query profile:
Let’s try it again, this time using direct table references.

with customer_names as (

select c_name

from snowflake_sample_data.tpch_sf1.customer

),

customer_addresses as (

select c_address

from snowflake_sample_data.tpch_sf1.customer

select c_name

from customer_names

union all

select c_address

from customer_addresses
As expected, we have two TableScan nodes and each retrieves only the
referenced columns.

Column pruning breaks down with wildcards and joins


Another place where Snowflake may not perform column pruning push-
down is with joins (thanks to Paul Vernon for noticing this one).
The TableScan of the nation table should ideally only retrieve
the n_nationkey and n_name columns, but instead retrieves them all.

with nations as (

select *

from snowflake_sample_data.tpch_sf1.nation

),

joined as (

select *

from snowflake_sample_data.tpch_sf1.customer

left join nations

on customer.c_nationkey = nations.n_nationkey

select c_address, n_name from joined


Recommendation
We recommend that column references are listed explicitly where CTEs are
used to ensure that TableScans retrieve only the required columns. Though,
if a query runs fast enough, the maintenance trade-off of listing out the
columns explicitly might not be worth it.

Correspondingly, we recommend against the select * from table CTEs


used in dbt's style guide. Instead, reference the needed table directly to
ensure column pruning.

So, should you use CTEs in Snowflake?


For almost all cases, yes. If your query runs fast enough and there aren’t
cost concerns, go ahead. It’s important not to optimize unnecessarily, as the
time and opportunity cost taken to do so can outweigh the benefits.

If you’re working on optimizing a particular query that uses CTEs though,


check the following:

1. Is a simple CTE referenced more than once? If a CTE doesn’t perform


much work, then the overhead
of WithClause and WithReference nodes may exceed simply repeating
the CTE's calculation using either subqueries or a view.
2. Are column references being pushed down and pruned as expected
in TableScan nodes? If not, try listing out the required columns as
early in the query as possible. This may improve the speed of
the TableScan node considerably for wide tables.

Identifying and actioning optimization opportunities is time-consuming.


SELECT makes it easy, automatically surfacing optimizations like those in
this post. Get automated savings from day 1, quickly identify cost centers
and optimize your Snowflake workloads. Get access today or book a demo
using the links below.
Controlling Costs with Snowflake
Resource Monitors
Date
Saturday, March 18, 2023

Ian Whitestone

Co-founder & CEO of SELECT

Resource monitors are a feature offered by Snowflake to help customers


control their spend. In this post we cover everything you need to know
about how to configure and best utilize resource monitors in your account.
To start things off, let's cover why you should use resource monitors in the
first place.

What are Snowflake resource monitors for?


Snowflake's usage-based pricing model and ease of scaling allow
customers to pay for just the resources they need while quickly scaling to
meet demand. The downside is that Snowflake costs can rise unexpectedly
when left unchecked or without proper controls in place. Resource
monitors help users monitor and control costs by sending notifications at
defined spend thresholds. Resource monitors can also be configured to
help avoid early contract renewal by mitigating unexpected credit usage.

What do Snowflake resource monitors do?


Resource monitors allow users to specify a Snowflake credit limit for a given
resource and timeframe. For example, a warehouse or grouping of
warehouses can be assigned a daily credit limit of 50 credits per day. When
the specified percentage of that limit is reached in the given timeframe,
resource monitors can trigger different actions. The three possible actions
are:

1. Notify the user via email


2. Suspend the warehouse(s) after all queries are finished running
3. Immediately suspend the warehouse(s) and abort any running
queries

Outside of warehouses, resource monitors can be configured for an entire


account.

The screenshot below shows an example of an email notification sent by a


resource monitor.
How do you create a resource monitor in
Snowflake?
Resources monitors can be created through the Snowflake UI or
programmatically with SQL. Let's go through each approach.

Creating a resource monitor in the UI


To create a new resource monitor in the Snowflake UI, navigate to the
Admin > Resource Monitors section. After clicking the "+ Resource
Monitor" button in the top right, you'll be prompted to fill out the following
details:
1. Name: Set this to anything you like, but note that spaces are not
allowed
2. Credit Quota: The desired credit allowance for the monitor
3. Monitor Type: Either "Account" or "Warehouse". If "Account", the
credit quoata applies to the entire account. If "Warehouse", the quota
only applies to the warehouses you specify.
 Warehouse(s): If you chose "Warehouse" as your monitor type, then
you will be prompted to select 1 or more warehouses to associate
with the monitor.
4. Schedule: Here, the start time and (optional) end time for when the
resource monitor operates can be configured. The frequency at which
the resource monitor resets its credit count can also be set.
Supported options for this include: daily, monthly, yearly and never.
5. Actions: At least one action must be configured to create the
monitor.

Creating a resource monitor with SQL


An example command to create a resource monitor with SQL is shown
below:

create or replace resource monitor compute_warehouse_monitor with

credit_quota=5
frequency=daily

start_timestamp=immediately

triggers

on 75 percent do notify

on 100 percent do suspend

on 110 percent do suspend_immediate

To assign the monitor to the COMPUTE warehouse, run the following:

alter warehouse compute set resource_monitor = compute_warehouse_monitor

The full syntax for this command can be found in the Snowflake
documentation.

What resource monitors should you consider


creating?
Consider creating an account-level resource monitor based on your
Snowflake contract amount. This can help ensure you don't finish your
contract budget earlier than expected.

For example, if the remaining Snowflake contract balance is $240,000 over


the next 6 months, then you could set a monthly account resource monitor
with a credit quota of 16,000 credits (assuming a credit price of $2.5/credit).
You could choose to get notified by email if the account's credit usage
exceeds 100% of that quota.

Another recommendation is to create a resource monitor to alert you of


any credit usage spikes. To accomplish this, you could create a resource
monitor to send an alert whenever daily spend exceeds 1.5x the average
daily amount. We recommend adjusting this threshold multiple as needed
to avoid being alerted too frequently.
How do you view resource monitors?
Resource monitors can be viewed in the Snowflake UI under Admin ->
Resource Monitors:

Alternatively, you can run show resource monitors as a SQL statement:


Who can create or access resource monitors?
Resource monitors can only be created by account administrators. Once a
resource monitor is created, roles can be granted access to view or
modify a monitor. Access has to be granted to each resource monitor
individually.

Any user in Snowflake can receive email notifications from resource


monitors. This is accomplished using SQL by setting the NOTIFY
USERS parameter on the resource monitor. See the Snowflake
documentation for more details.

What are the downsides of Snowflake resource


monitors?
While resource monitors are a great starting point for controlling Snowflake
credit usage and avoiding unexpected charges, there are a few downsides
as of today.

First, you can only set resource monitors at the account or warehouse level.
Serverless features such as automatic clustering or materialized views can
also incur sudden charges but are not supported by resource monitors.

Resource monitors only support notifications via email. Most teams prefer
to receive notifications directly in Slack or Teams along with the rest of their
alerts and team communication.
Additionally, as shown in the screenshot earlier, the email notifications
contain little information about what caused the credit allowance to be
exceeded. Users must then dig through their account usage data to
determine what caused the spike.

If you'd like a richer experience, we recommend using SELECT which offers


these more advanced features out of the box with minimal set-up effort.

Troubleshooting no email from resource monitor


If you are not receiving an email from a resource monitor, it may be
because notifications haven't been enabled for the Snowflake admin.
See this page for instructions on how to accomplish that.
How to use Query Timeouts in Snowflake
Date
Sunday, March 19, 2023

Ian Whitestone

Co-founder & CEO of SELECT

Query timeouts are an important tool for Snowflake users looking to


control costs and prevent accidental cost spikes. In this post we’ll cover why
they’re useful and how they can be configured.

What are Snowflake query timeouts?


Query timeouts are a setting that prevents Snowflake queries from running
for too long. If a query runs for longer than the timeout setting, the query is
automatically cancelled by Snowflake.
Why use a timeout?
In Snowflake, queries run on virtual warehouses and Snowflake charges for
each second a warehouse is resumed, or “active”. For example, if someone
runs an inefficient query on a large dataset and it takes 2 days to complete
on an extra-large warehouse, that single query will cost the customer nearly
$2,000 dollars <sup 2> .

By lowering the query timeout setting, Snowflake customers can control


their costs and avoid unexpected charges from long running queries.

What is the default query timeout in Snowflake?


The default query timeout in Snowflake is 2 days (172,800 seconds) for
sessions and warehouses (source). Due to this long default, most Snowflake
customers should consider updating their query timeouts to shorter times.

How to configure query timeouts in Snowflake


Query timeouts can be set in a few different ways: account wide, per user,
for a single session, or for a given warehouse. All methods use
the STATEMENT_TIMEOUT_IN_SECONDS parameter.

Session Query Timeout


To check the existing query timeout setting for the current session, run the
command below:

show parameters for session

To set the query timeout to 1 hour for the current session, run the
command below:

alter session set statement_timeout_in_seconds = 3600

User Query Timeout


To see the existing query timeout setting for a given user ( analytics_user),
run the command below:

show parameters for user analytics_user

To set the query timeout to 1 hour for a given user, run the command
below:

alter user analytics_user set statement_timeout_in_seconds = 3600

Account Wide Timeout


To see the existing query timeout setting for the account, run the command
below:

show parameters for account

It will likely be set to the default 2 days (172,800). To drop this down to 1
day, run the command below:

alter account set statement_timeout_in_seconds = 86400

Warehouse Query Timeout


To check the current query timeout setting for an existing warehouse, run
the command below. In this example, the warehouse is named COMPUTE.
show parameters for warehouse compute

To change the query timeout to 1 hour for the warehouse, run the
command below:

alter warehouse compute set statement_timeout_in_seconds = 3600

The timeout setting can also be configured when creating a new


warehouse:

create warehouse compute

warehouse_size = 'XSMALL'

statement_timeout_in_seconds = 3600

Which query timeout is used when multiple are set?


When multiple query timeout settings are present, Snowflake enforces the
lowest setting. For example, if a session has a query timeout setting of 1
hour and a warehouse has a timeout setting of 10 minutes, then Snowflake
will cancel any query that runs for longer than 10 minutes.

How to configure task timeouts in Snowflake


Timeouts for Snowflake tasks are configured using
the USER_TASK_TIMEOUT_MS parameter. The default timeout is 1 hour (source).
Note that the unit is milliseconds as opposed to seconds for query
timeouts.

Similar to warehouses and sessions, you can check the current timeout
setting by running show parameters for task my_task . To change the task
timeout to 60 seconds, run alter task my_task set user_task_timeout_ms = 60000 .
Are you billed for queries and tasks cancelled by
Snowflake timeouts?
Yes. Snowflake charges customers for each second a virtual warehouse is
active. If a query runs on a virtual warehouse for 4 hours before it is
cancelled by Snowflake, customers will be charged for the 4 hours that
warehouse was active.
Identifying unused tables in Snowflake
Date
Monday, March 20, 2023

Ian Whitestone

Co-founder & CEO of SELECT

While Snowflake storage costs tend to be a small portion of overall


Snowflake spend, many customers have a significant number of unused
tables in their accounts incurring unnecessary charges. If a dataset isn’t
being used, adding value to the business or required by law to be stored, it
should be removed.

Removing unused datasets can be a quick win for teams looking to reduce
their Snowflake spend. It can also improve security and reduce risks
associated with data breaches and data exposure. The less data you store,
the smaller the footprint for unintended access.

Lastly, deleting unused tables can improve overall data warehouse usability.
Unused datasets often contain data that is stale or not meant to be
accessed, so removing these tables can help avoid confusion or reporting
errors.

In this post we’ll cover how to identify unused tables in Snowflake using
the access_history account usage view.
Skip to the final SQL?
If you want to skip ahead and see the final SQL implementation, you can
head straight to the end!

Snowflake Access History View


Access History is a view in the Account Usage schema of the Snowflake
Database. It is available for all Snowflake accounts on Enterprise Edition or
higher. Access History can be used to look up the Snowflake objects (i.e.
tables, views, and columns) accessed by each query, either directly or
indirectly.

Direct versus Base Objects Accessed


To determine which columns were accessed by a query, there are two
columns of interest: direct_objects_accessed and base_objects_accessed . A key
difference between the two columns comes from how they handle views.
Consider the following view definition:

create or replace view orders_view as (

select *

from orders

where

not test

and success

);

The query select * from orders_view directly accesses the orders_view object,
and indirectly accesses the base orders table.
Correspondingly, orders_view will appear in the direct_objects_accessed column
of access_history, whereas orders will appear in base_objects_accessed .
When it comes to deciding if a table is unused, it’s important to
use base_objects_accessed since this will account for queries that indirectly
access a table through a view.

Parsing base_objects_accessed
is a JSON array of all base data objects accessed
base_objects_accessed
during query execution. Here’s an example of the column’s contents from
the documentation:

"columns": [

"columnId": 68610,

"columnName": "CONTENT"

],

"objectDomain": "Table",

"objectId": 66564,

"objectName": "GOVERNANCE.TABLES.T1"

The array of objects accessed by each query can be transformed to one row
per object via lateral flatten and then filtered to only consider table
objects, as shown below:
with

access_history as (

select *

from snowflake.account_usage.access_history

),

access_history_flattened as (

select

access_history.query_id,

access_history.query_start_time,

access_history.user_name,

objects_accessed.value:objectId::integer as table_id,

objects_accessed.value:objectName::text as object_name,

objects_accessed.value:objectDomain::text as object_domain,

objects_accessed.value:columns as columns_array

from access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed

),

table_access_history as (

select

query_id,

query_start_time,
user_name,

object_name as fully_qualified_table_name

from access_history_flattened

where

object_domain = 'Table' -- removes secured views

and table_id is not null -- removes tables from a data share

select *

from table_access_history

Find when a table was last queried/accessed


Using the “flattened” access_history from the query above, we can determine
the exact time a table was last accessed along with the user who ran the
query:

with

access_history as (

select *

from snowflake.account_usage.access_history

),

access_history_flattened as (

select

access_history.query_id,
access_history.query_start_time,

access_history.user_name,

objects_accessed.value:objectId::integer as table_id,

objects_accessed.value:objectName::text as object_name,

objects_accessed.value:objectDomain::text as object_domain,

objects_accessed.value:columns as columns_array

from access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed

),

table_access_history as (

select

query_id,

query_start_time,

user_name,

object_name as fully_qualified_table_name

from access_history_flattened

where

object_domain = 'Table' -- removes secured views

and table_id is not null -- removes tables from a data share

select
fully_qualified_table_name,

max(query_start_time) as last_accessed_at,

max_by(user_name, query_start_time) as last_accessed_by,

max_by(query_id, query_start_time) as last_query_id

from table_access_history

group by 1

Calculate table storage costs


When identifying unused tables to delete, it’s helpful to see the associated
storage costs. Using the table_storage_metrics account usage view and an
assumed storage rate of $23 per terabyte per month, the annual storage
cost of each table can be calculated:

select

id as table_id,

table_catalog || '.' ||table_schema ||'.' || table_name as fully_qualified_table_name,

(active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes)/power(1024,4) as


total_storage_tb,

-- Assumes a storage rate of $23/TB/month

-- Update to the appropriate value based on your Snowflake contract

total_storage_tb*12*23 as annualized_storage_cost

from snowflake.account_usage.table_storage_metrics

where

not deleted
Identify all tables not queried in the last X days
So far we’ve covered how to determine when a table was last
accessed and the storage costs associated with each table. We can tie
these building blocks together to identify all tables not queried in the last
90 days and show the annual savings that could be expected if the tables
were deleted.

Standard Edition not supported!


The SQL below relies on the account_usage.access_history view which is only
available for Snowflake customers on Enterprise Edition and higher.
Using dbt?
If you are using dbt, consider the alternative version of this SQL which
runs must faster.

with

access_history as (

select *

from snowflake.account_usage.access_history

),

access_history_flattened as (

select

access_history.query_id,

access_history.query_start_time,

access_history.user_name,

objects_accessed.value:objectId::integer as table_id,

objects_accessed.value:objectName::text as object_name,

objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array

from access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed

),

table_access_history as (

select

query_id,

query_start_time,

user_name,

object_name as fully_qualified_table_name,

table_id

from access_history_flattened

where

object_domain = 'Table' -- removes secured views

and table_id is not null -- removes tables from a data share

),

table_access_summary as (

select

table_id,

max(query_start_time) as last_accessed_at,

max_by(user_name, query_start_time) as last_accessed_by,


max_by(query_id, query_start_time) as last_query_id

from table_access_history

group by 1

),

table_storage_metrics as (

select

id as table_id,

table_catalog || '.' ||table_schema ||'.' || table_name as fully_qualified_table_name,

(active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes)/power(1024,4) as


total_storage_tb,

-- Assumes a storage rate of $23/TB/month

-- Update to the appropriate value based on your Snowflake contract

total_storage_tb*12*23 as annualized_storage_cost

from snowflake.account_usage.table_storage_metrics

where

not deleted

select

table_storage_metrics.*,

table_access_summary.* exclude (table_id)

from table_storage_metrics

inner join table_access_summary


on table_storage_metrics.table_id=table_access_summary.table_id

where

last_accessed_at < (current_date - 30) -- Modify as needed

order by table_storage_metrics.annualized_storage_cost desc

Identifying unused tables with dbt


Querying and flattening the access_history view can be very slow due to the
volume of data that must be processed. For faster queries about table
access history, we recommend incrementally materializing this data using
our open-source dbt package: dbt_snowflake_monitoring. Once the
package is installed, queries to identify unused tables becomes much
simpler. The code from above can be re-written as:

with

table_access_summary as (

select

table_id,

max(query_start_time) as last_accessed_at,

max_by(user_name, query_start_time) as last_accessed_by,

max_by(query_id, query_start_time) as last_query_id

from query_base_table_access

group by 1

),

table_storage_metrics as (

select
id as table_id,

table_catalog || '.' ||table_schema ||'.' || table_name as fully_qualified_table_name,

(active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes)/power(1024,4) as


total_storage_tb,

-- Assumes a storage rate of $23/TB/month

-- Update to the appropriate value based on your Snowflake contract

total_storage_tb*12*23 as annualized_storage_cost

from snowflake.account_usage.table_storage_metrics

where

not deleted

select

table_storage_metrics.*,

table_access_summary.* exclude (table_id)

from table_storage_metrics

inner join table_access_summary

on table_storage_metrics.table_id=table_access_summary.table_id

where

last_accessed_at < (current_date - 30) -- Modify as needed

order by table_storage_metrics.annualized_storage_cost desc

Find when a table was last updated


As part of deciding whether to delete a table, it can be helpful to know
when the table was last updated by a DDL or DML operation. The query
below shows how to find all tables that were updated in the past week by
using the tables account usage view:

select

table_id,

table_catalog||'.'||table_schema||'.'||table_name as fully_qualified_table_name,

last_altered as last_altered_at

from snowflake.account_usage.tables

where

last_altered > current_date - 7

Wrapping Up
Removing unused tables represents one of the many cost saving
opportunities available to Snowflake users. In addition to surfacing table
access patterns, SELECT automatically produces a variety of other
optimization recommendations. Get access today or book a demo using the
links below.
Snowflake Pricing Explained | 2024 Billing
Model Guide
Date
Wednesday, March 6, 2024

Niall Woodward
Co-founder & CTO of SELECT

Ian Whitestone

Co-founder & CEO of SELECT

Snowflake Overview
Snowflake is a data cloud platform used by organizations to store, process
and analyse data. Snowflake uses the big three cloud providers for hosting
- AWS (Amazon Web Services), GCP (Google Cloud Platform), and Microsoft
Azure. Snowflake is a fully-managed platform, and users have no direct
access to the underlying infrastructure. This is owed to Snowflake’s goal of
making the platform straightforward to use by managing complexity while
still providing a powerful feature set.

Snowflake’s stand out features include decoupled storage and compute


layers, just-in-time provisioning and automatic suspension of unused
compute instances (known as virtual warehouses). The decoupled storage
layer enables features such as zero-copy cloning and data sharing.
Snowflake’s Pricing Model
Like most cloud SaaS (Software as a Service) platforms, Snowflake utilizes
usage based pricing. Rather than a fixed monthly or annual fee, Snowflake
tracks usage volumes across computation, data storage and transfer,
calculating costs from pre-determined rates for each.

Snowflake has a currency called ‘credits’. Snowflake Credits are consumed


by performing activities within the platform - running virtual warehouses
etc. The cost of each credit depends on three main factors - Snowflake
edition, hosting location, and cloud provider.

Snowflake Editions
You can think of Snowflake Editions as different plans Snowflake offers.

Snowflake currently offers four editions: Standard, Enterprise, Business


Critical, and Virtual Private Snowflake (VPS). Each Snowflake Edition is
differentiated by the availability of certain features. The primary
differentiators are that Enterprise customers gain multi-cluster warehouses
(horizontal scaling), with business critical and VPC editions focusing almost
entirely on increased security and data protection. For full details on the
features offered between each edition, see the Snowflake documentation.

Snowflake Regions
An individual Snowflake account runs in a single region. Customers are free
to create as many accounts as they wish across both cloud providers and
regions. Snowflake has over 35 regions to choose from. Customers typically
choose the same region and cloud provider as their existing infrastructure
for their primary account.

Advantages of Snowflake pricing


Snowflake’s usage-based pricing model means you only pay for the
resources you consume. This fact combined with Snowflake’s speed of
provisioning (typically less than a second), means that customers can use
automatic scaling to switch off unused virtual warehouses without
experiencing performance issues when they’re needed again.

Disadvantages of Snowflake pricing


The main disadvantage with Snowflake’s pricing model, like all usage-based
pricing models is the variability. It is difficult to produce accurate cost
estimates prior to adopting Snowflake, and costs vary with usage. This
combined with the freedom of users to potentially spend a lot of money by
using over-sized virtual warehouses means it’s essential to have robust
monitoring and budgeting practices in place.

Is Snowflake cost effective?


Snowflake is extremely cost effective, provided the right practices are in
place. We recommend implementing robust monitoring practices to track
your usage using either Snowflake’s built-in reporting in the UI, creating
custom dashboards, or using a product like SELECT.

Snowflake Credit Pricing


As mentioned earlier, the cost of each Snowflake Credit depends on four
factors:

1. Snowflake edition
2. Hosting region
3. Cloud provider
4. Discounts

Here is the range of per-credit pricing across each edition, using on-
demand payment terms, where usage is invoiced every month. The lower
value of each range represents the typical US AWS regions used by most
customers, with the upper values in regions outside of the USA.

Standard Enterprise Business Critical VPS (Virtua

$2.00 - $3.10 $3.00 - $4.65 $4.00 - $6.20 $6.00 - $9.30

Most customers opt to pay for Snowflake using capacity commitment


contracts (capacity pricing), where usage is pre-purchased upfront. Opting
for a capacity commitment contract provides discounts on the per-credit
prices on a sliding scale.

For a comprehensive table of Snowflake credit costs across all clouds and
regions, see the Snowflake Credit Consumption Table.

Pricing for Each Snowflake Service


Virtual Warehouse Pricing
Virtual warehouses are typically the primary source of cost, as they are used
to run queries. Customers will often refer to these as "compute costs". The
price of Snowflake warehouses varies based on their size. Here's a list of
each virtual warehouse size as well as the associated costs:

Warehouse Size Credits / Hour Snowpark-Optimized

X-Small 1 N/A

Small 2 N/A

Medium 4 6

Large 8 12

X-Large 16 24
Warehouse Size Credits / Hour Snowpark-Optimized

2X-Large 32 48

3X-Large 64 96

4X-Large 128 192

5X-Large 256 384

6X-Large 512 768

Each warehouse size increment doubles the resources available. Snowpark-


optimized warehouses are a newer warehouse type, with 16x the memory
of the ‘normal’ warehouse type for each size, at 1.5X the cost.

Virtual warehouse compute costs typically make up 80% of a Snowflake


customer's bill. As a result, they are often the focus of any cost
optimization efforts.

Serverless Pricing
For Snowflake’s serverless features like Snowpipe, Automatic
Clustering and Serverless tasks, credits are consumed using a multiplier
specific to each feature. The cheapest services from a credits per hour
perspective are Query Acceleration and Snowpipe Streaming, which both
incur a cost of 1 compute credit per hour. The most expensive feature is the
Search Optimization Service, which incurs a cost of 10 compute credits per
hour.

Compute Credits per


Feature Hour Cloud Services C

Clustered tables 2 1

Copy Files 2 N/A

Logging 1.25 1

Materialized views maintenance 10 5

Materialized views maintenance in secondary


databases 2 1

Query acceleration 1 1
Compute Credits per
Feature Hour Cloud Services C

Replication 2 1

Search optimization service 10 5

Search optimization service in secondary databases 2 1

Serverless tasks 1.5 1

Snowpipe 1.25 N/A but charged 0.06 Snowflake Cre

N/A but charged at an hourly rate of 0


Snowpipe Streaming 1 instance

Storage Pricing
Snowflake stores data in a proprietary file format called micro-partitions in
the cloud storage service of the underlying cloud provider (i.e. Amazon S3,
Azure Blob Storage, Google Cloud Storage).

Snowflake uses direct dollar pricing for storage. Prices again vary
depending on cloud provider and region. Customers on AWS USA regions
pay $23 per TB per month. Regions outside of the USA again are more
expensive. A comprehensive breakdown of all storage costs across all
regions can be found on the Snowflake website.

If you're looking for a deep dive into Snowflake storage costs, be sure to
check out our post here. For quick wins on reducing unnecessary storage
costs, you can query your account to identify any unused tables and
remove them.

Data Transfer Costs


Data transfer is the process of moving data into and out of Snowflake. Data
moving into Snowflake is often referred to as "ingress", whereas data
moving out of Snowflake is referred to as "egress". Here are some
important notes on how data transfer costs work in Snowflake:

1. Snowflake does not charge for data ingress.


2. It is free of charge to transfer data between the same region and
cloud provider.
3. Only specific Snowflake features incur data transfer costs (unloading
data, data replication, using external functions, etc. - see here).
4. Data egress charges do not apply when a Snowflake client or driver
retrieves query results, even if those happen across cloud platforms
or regions!

Again, the full costs can be found on the Snowflake website.

Cloud Services Costs


Snowflake’s cloud services layer is responsible for everything that isn’t the
actual storing and processing of data. That includes authentication, query
compilation, and zero-copy cloning to name a few. Snowflake’s pricing for
cloud services uses a fair-use style model, where so long as cloud services
usage doesn’t exceed 10% of compute usage, no additional costs are
incurred. For example, if a customer uses 100 compute credits, and 5 cloud
services credits, they are canceled out:
Service C

Compute 100

Cloud Services 5

Cloud Services Rebate -5

Total 100

If the cloud services credits increase to 15 however:

Service C

Compute 100

Cloud Services 15

Cloud Services Rebate -10


Service C

Total 105

Only 10 credits are rebated, calculated as 10% of the compute credits used.
Therefore, the customer is charged for 5 cloud services credits.

Most customers never pay for cloud services due to this 10% policy.
Scenarios where this isn’t the case are typically where a large number of
simple queries are been executed, as these have a high cloud services cost
relative to their compute costs.

Snowpark Container Services Pricing


Snowpark Container Services (SPCS) is a new, fully managed container
offering from Snowflake. As of February 2024, it is currently in public
preview in a select number of AWS regions. SPCS allows Snowflake
customers to run containerized workloads directly in Snowflake. You can
learn more about the service in the Snowflake documentation.

SPCS runs on top of Compute Pools, which are different than virtual
warehouses. The credits per hour for each type of compute are shown
below:
Compute Node Type XS S M

CPU 0.11 0.22 0.43

High-Memory CPU N/A 0.56 2.22

GPU N/A 1.14 5.366

A detailed breakdown of each compute node type can be found below:

vCP Memory Storage GPU Memory Max.


INSTANCE_FAMILY U (GiB) (GiB) GPU per GPU (GiB) Limit

Smallest in
Containers.
CPU - XS 2 8 250 n/a n/a 50 started.

Ideal for ho
CPU - S 4 16 250 n/a n/a 50 saving cost

Ideal for ha
CPU - M 8 32 250 n/a n/a 20 multiple se
vCP Memory Storage GPU Memory Max.
INSTANCE_FAMILY U (GiB) (GiB) GPU per GPU (GiB) Limit

For applica
CPU - L 32 128 250 n/a n/a 20 number of

High-Memory CPU -
S 8 64 250 n/a n/a 20 For memor

High-Memory CPU - For hosting


M 32 256 250 n/a n/a 20 application

High-Memory CPU - Largest hig


L 128 1024 250 n/a n/a 20 processing

1 NVIDIA Our smalle


GPU - S 8 32 250 A10G 24 10 Snowpark C

4 NVIDIA Optimized
GPU - M 48 192 250 A10G 24 5 like Compu

GPU - L 192 2048 250 8 NVIDIA 40 On Largest GP


A100 request advanced G
vCP Memory Storage GPU Memory Max.
INSTANCE_FAMILY U (GiB) (GiB) GPU per GPU (GiB) Limit

etc.

Snowflake Pricing Example


Let’s work through a realistic example of Snowflake’s pricing across a
month. Suppose we run a data platform for a relatively small organization,
with data loading jobs, transformations, and a BI tool querying the
transformed data. We also have some Snowpipes doing ingestion from S3,
and have a couple of tables with automatic clustering turned on to keep
where predicates performant on a date column. Across the account, we
have 5TB of storage, and are running the Snowflake account in AWS US
East 1 on Enterprise edition. Each credit costs $3.
Frequently Asked Questions
Does Snowflake offer a free plan or free trial?
Yes, a free trial of Snowflake with $400 of free credits can be created here.

What is the minimum billing fee for Snowflake?


Snowflake has no minimum billing requirement for on-demand pricing. The
minimum amount for a capacity commitment contract is $25,000.

How do Snowflake Capacity Contracts work?


For any customers spending over (or close to) $25,000 / year on Snowflake,
it makes sense to consider signing an annual capacity commitment contract
as Snowflake will give you discounts and a dedicated account manager. The
biggest discount will be for Storage, where the pricing drops from the on-
demand rate of $40/TB to $23/TB.

With Snowflake contracts, you pay upfront for a pre-committed capacity


amount. As you use Snowflake and consume Snowflake credits, Snowflake
will deduct those fees from your available balance. In the event you use
more than your pre-committed amount, you can purchase additional
capacity. If you use less than capacity you purchased, you can roll over your
unused capacity to your next contract if the next contract you sign is for the
same amount or higher.

Due to this nature, it's important to ensure you don't over-commit on


unneeded capacity, since you can end up in a situation where you lose all
your unused capacity in order to renew at a lower rate and avoid
continuing to over-commit.

Does Snowflake offer discounts?


Snowflake offers discounts on capacity commitment contracts, which
increase with the amount of capacity purchased as well as your contract
length (i.e. 1 year vs. 3 year). The discount tiers and amounts are not
publicized.
BigQuery vs. Snowflake pricing?
Google BigQuery has two pricing models, on-demand, and capacity.
Despite sharing the same names with Snowflake’s pricing models, they are
very different. In on-demand, BigQuery charges for the data scanned per
query. In the capacity model, BigQuery charges per slot/hour, where a slot
is a unit of compute. This is very similar to Snowflake’s pricing model, and
shares the same per-second billing increment and 1 minute minimum
charge. Like Snowflake, BigQuery has several editions to choose from with
different feature availability.

Databricks vs. Snowflake pricing?


Databricks differs from BigQuery and Snowflake in that Databricks runs
workloads on compute instances that you pay for in your own cloud
account. Consequently, it incurs costs both to Databricks directly, and in
your cloud account. Databricks also offers a Serverless SQL model where
the compute instances are managed by Databricks. For this service, costs
are only paid to Databricks. This is more closely aligned with Snowflake and
BigQuery’s pricing and operating models.

Redshift vs. Snowflake pricing?


Amazon Redshift has two operating and pricing models available: DC2 and
RA3. DC2 is the more traditional data warehouse deployment model, where
compute instances and local storage are bound together. RA3 on the other
hand separates storage and computation. The advantage of this model is
that surplus capacity is minimized, as compute and storage can
independently scale to match the needs of the customer. RA3 is the more
similar option to Snowflake.

Is Snowflake expensive?
There is a widespread rhetoric that Snowflake is expensive, and if managed
improperly, it can be. Customers not choosing the right warehouse
size or creating too many warehouses without proper controls in place can
often be a culprit of runaway costs. However, all usage-based cloud
platforms get expensive when not used thoughtfully, this isn’t unique to
Snowflake. When using the right processes, monitoring and management,
Snowflake can be a very cost-effective choice for a cloud data platform.
At SELECT, we've built our entire data platform on top of Snowflake due to
its ease of use, scalability, and it's cost-effectiveness.

You might also like