Professional Documents
Culture Documents
Niall Woodward
AWS - S3
Azure - Azure Blob Storage
GCP - Google Cloud Storage
Ok, enough talking, time for a diagram. Throughout the post, we will use
the example of an orders table. Here's what one of the micro-partitions
stored for the table looks like:
In the header, the byte ranges for each column within the micro-partition
are stored, allowing Snowflake to retrieve only the columns relevant to a
query using a byte range get. This is why queries will run faster if you
reduce the number of columns selected.
Let's zoom out to the entire orders table. In this example, the table contains
28 micro-partitions, each with three rows of data (in practice, a micro-
partition typically contains hundreds of thousands of rows).
from orders
In summary
We've learned what micro-partitions are, how Snowflake uses them for
query optimization through pruning, and the various ways to measure
pruning performance.
==============\
Niall Woodward
select *
from orders
Frequently used where clause filtering keys are good choices for clustering
keys. For example:
select *
from table_a
1. Natural clustering
Suppose there is an ETL process adding new events to an events table each
hour. A column inserted_at represents the time at which events are loaded
into the table. Newly created micro-partitions will each have a tightly
bound range of inserted_at values. This events table would be described
to be naturally clustered on the inserted_at column. A query that filters
this table on the inserted_at column will prune micro-partitions effectively.
Pros
Cons
Only works for queries that filter on a column that correlates to the
order in which data is inserted
The automatic clustering service is simple to use, but easy to spend money
with. If you choose to use it, make sure to monitor both the cost and
impact on queries on the table to determine if it achieves a good
price/performance ratio. If you're interested in learning more about the
automatic clustering service, check out this detailed post on the inner
workings by one of Snowflake's engineers.
Pros
The lowest effort way to cluster on a different key to the natural key.
Doesn't block or interfere with DML operations.
Cons
Unpredictable costs.
Snowflake takes a higher margin on automatic clustering than
warehouse compute costs, which can make automatic clustering less
cost-effective than manual re-sorting.
3. Manual sorting
With fully recreated tables
If a table is always fully recreated as part of a transformation/modeling
process, the table can be perfectly clustered on any key by adding an order
by statement to the create table as (CTAS) query:
with transformations as (
...
select *
from transformations
order by my_cluster_key
)
In this scenario of a table that is always fully recreated, we recommend
always using manual sorting over the automatic clustering service as the
table will be well-clustered, and at a much lower cost than the automatic
clustering service.
On existing tables
Manually re-sorting an existing table on a particular key simply replaces the
table with a sorted version of itself. Let’s suppose we have a sales table with
entries for lots of different stores, and most queries on the table always
filter for a specific store. We can perform the following query to ensure that
the table is well-clustered on the store_id:
As new sales are added to the table over time, the existing micro-partitions
will remain well-clustered by store_id, but new micro-partitions will
contain records for lots of different stores. That means that older micro-
partitions will prune well, but new micro-partitions won't. Once
performance decreases below acceptable levels, the manual re-sorting
query can be run again to ensure that all the micro-partitions are well-
clustered on store_id.
Pros
Higher effort than the automatic clustering service. Requires the user
to either manually execute the sorting query or implement
automated orchestration of the sorting query.
Replacing an existing table with a sorted version of itself reverses any
DML operations which run during the re-sort.
It’s common to see that most queries for an organization filter by the same
columns, such as region or store_id. If queries with common filtering
patterns are causing full table scans, then depending on how the table is
populated, consider using automatic clustering or manual re-sorting to
cluster on the filtered column. If you’re not sure how you’d implement
manual re-sorting or there's a risk of DML operations running during the
re-sort, use the automatic clustering service.
Closing
Ultimately, pruning is achieved with complementary query design and table
clustering. The more data, the more powerful pruning is, with the potential
to improve a query's performance by orders of magnitude.
We’ll go deeper on the topic of clustering in future posts, including the use
of Snowflake’s system$clustering_information function to analyze
clustering statistics. We'll also explore options for when a table needs to be
well-clustered on more than one column, so be sure to subscribe to our
mailing list below. Thanks for reading, and please get in touch
via Twitter or email where we'd be happy to answer questions or discuss
these topics in more detail.
Choosing the right warehouse size
in Snowflake
Date
Sunday, November 27, 2022
Niall Woodward
The days of complex and slow cluster resizing are behind us; Snowflake
makes it possible to spin up a new virtual warehouse or resize an existing
one in a matter of seconds. The implications of this are:
1. Significantly reduced compute idling (auto-suspend and scale-in for
multi-cluster warehouses)
2. Better matching of compute power to workloads (ease of
provisioning, de-provisioning and modifying warehouses)
Before we look to answer that question, let's first understand what a virtual
warehouse is, and the impact of size on its available resources and query
processing speed.
While the nodes in each warehouse are physically separated, they operate
in harmony, and Snowflake can utilize all the nodes for a single query.
Consequently, we can work on the basis that each warehouse size increase
doubles the available compute cores, RAM, and disk space available.
What virtual warehouse sizes are available?
Snowflake uses t-shirt sizing names for their warehouses, but unlike t-shirts,
each step up indicates a doubling of resources and credit consumption.
Sizes range from X-Small to 6X-Large. Most Snowflake users will only ever
use the smallest warehouse, the X-Small, as it’s powerful enough for most
datasets up to tens of gigabytes, depending on the complexity of the
workloads.
X-Small 1
Small 2
Medium 4
Warehouse Size Credits
Large 8
X-Large 16
2X-Large 32
3X-Large 64
4X-Large 128
5X-Large 256
6X-Large 512
The impact of warehouse size on Snowflake query
speeds
1. Processing power
Snowflake uses parallel processing to execute a query across multiple cores
wherever it is faster to do so. More cores means more processing power,
which is why queries often run faster on larger warehouses.
Cost vs performance
CPU-bound queries will double in speed as the warehouse size increases,
up until the point at which they no longer fully utilize the warehouse’s
resources. Ignoring warehouse idle times from auto-suspend thresholds, a
query which runs twice as fast on a medium than a small warehouse will
cost the same amount to run, as cost = duration x credit usage rate . The
below graph illustrates this behavior, showing that at a certain point, the
execution time for bigger warehouses remains the same while the cost
increases. So, how do we find that sweet spot of maximum performance for
the lowest cost?
A warehouse can run more than one query at a time, so where possible
keep warehouses fully loaded and even with light queueing for maximum
efficiency. Warehouses for non-user queries such as transformation
pipelines can often be run at greater efficiency due to the tolerance for
queueing.
1. The number of queries by execution time. Here you can see that over
98% of the queries running on this warehouse are taking less than 1
second to execute.
2. The number of queries by utilizable warehouse size. Utilizable
warehouse size represents the size of warehouse a query can fully
utilize. Where lots of queries don't utilize the warehouse's size, it
indicates that the warehouse is oversized or the queries should run
on a smaller warehouse. In this example, over 96% of queries being
run on the warehouse aren’t using all 8 nodes available in the Large
warehouse.
Using partitions scanned as a heuristic
Another helpful heuristic is to look at how many micro-partitions a query
is scanning, and then choose the warehouse size based off that. This
strategy comes from Scott Redding, a resident solutions architect at
Snowflake.
The intuition behind this strategy is that the number of threads available for
processing doubles with each warehouse size increase, and each thread can
process a single micro-partition at a time. You want to ensure that each
thread has plenty of work available (files to process) throughout the query
execution.
To interpret this chart, this goal is to aim for 250 micro-partitions per
thread. If your query needs to scann 2000 micro-partitions, then running
the query on an X-Small will give each thread 250 micro-partitions (files) to
process, which is ideal. Compare this with running the query on a 3XL
warehouse, which has 512 threads. Each of these threads will only get 4
micro-partitions to process, which will likely results in many threads sitting
unused.
The main pitfall with this approach is that while micro-partitions scanned is
a significant factor in the query execution, other factors like query
complexity, exploding joins, and volume of data sorted will also impact the
required processing power.
Closing
Snowflake makes it easy to match workloads to warehouse configurations,
and we’ve seen queries more than double in speed while costing less
money by choosing the correct warehouse size. Increasing warehouse size
isn't the only option available to make a query run faster though, and many
queries can be made to run more efficiently by identifying and resolving
their bottlenecks. We'll provide a detailed guide on query optimization in a
future post, but if you haven't yet, check out our previous post on
clustering.
Ian Whitestone
Niall Woodward
Snowflake's huge popularity is driven by its ability to process large volumes of data at
extremely low latency with minimal configuration. As a result, it is an established favorite of
data teams across thousands of organizations. In this guide, we share optimization techniques
to maximize the performance and efficiency of Snowflake. Follow these best practices to
make queries run faster while also reducing costs.
All the Snowflake performance tuning techniques discussed in this post are based on the real
world strategies SELECT has helped over 100 Snowflake customers employ. If you think
there's something we've missed, we'd love to hear from you! Reach out via email or use the
chat bubble at the bottom of the screen.
Queries can sometimes spend considerable time reading data from table storage. This step of
the query is shown as a TableScan in the query profile. A TableScan involves downloading
data over the network from the table's storage location into the virtual warehouse's worker
nodes. This process can be sped up by reducing the volume of data downloaded, or increasing
the virtual warehouse's size.
Snowflake only reads the columns which are selected in a query, and the micro-partitions
relevant to a query's filters - provided that the table's micro-partitions are well-
clustered on the filter condition.
The four techniques to reduce the data downloaded by a query and therefore speed up
TableScans are:
Operations like Joins, Sorts, Aggregates occur downstream of TableScans and can often
become the bottleneck in queries. Strategies to optimize data processing include reducing the
number of query steps, incrementally processing data, and using your knowledge of the data
to improve performance.
Techniques to improve data processing efficiency include:
Snowflake's virtual warehouses can be easily configured to support larger and higher
concurrency workloads. Key configurations which improve performance are:
Before diving into optimizations, let's first remind ourselves how to identify what's slowing a
query down.
To figure this out, use the Snowflake Query Profile (query plan), and look to the 'Most
Expensive Nodes' section. This tells you which parts of the query are taking up the most
query execution time.
In this example, we can see the bottleneck is the Sort step, which would indicate that we
should focus on improving the data processing efficiency, and possibly increase the
warehouse size. If a query's most expensive node(s) are TableScans, efforts will be best
spent optimizing the data read efficiency of the query.
1. Your query needs to include a filter which limits the data required by the query. This
can be an explicit where filter or an implicit join filter.
2. Your table needs to be well clustered on the column used for filtering.
Running the below query against the hypothetical orders table shown in the diagram will
result in query pruning since (a) the orders table is clustered by created_at (the data is
sorted by created_at) and (b) the where clause explicitly filters the created_at with a
specific date.
select *
from orders
To determine whether pruning performance can be improved, take a look at the query
profile's Partitions scanned and Partitions total statistics.
If you're not using a where clause filtering in a query, adding one may speed up the
TableScan significantly (and downstream nodes too as they process less data). If your query
already has where clause filtering but the 'Partitions scanned' are close to the 'Partitions total',
this means that the where clause is not being effectively pruned on.
1. Ensuring where clauses are placed as early in queries as possible, otherwise they may
not be 'pushed down' onto the TableScan step (this also speeds up later steps in the
query)
2. Adding well-clustered columns to join and merge conditions which can be pushed
down as JoinFilters to enabling pruning
3. Making sure that columns used in where filters of a query align with the
table's clustering (learn more about clustering here)
4. Avoiding the use of functions in where conditions - these often prevent Snowflake
from pruning micro-partitions
3. Use clustered columns in join predicates
The most common form of pruning most users will be familiar with is static query pruning.
Here’s a simple example, similar to the one above:
select *
from orders
If the orders table is clustered by order_date, Snowflake’s query optimizer will recognize
that most micro-partitions (files) containing data older than 7 days ago can be ignored. Since
scanning remote data will requires significant processing time, eliminating micro-partitions
will greatly increase the query speed.
A lesser known feature of Snowflake’s query engine is dynamic pruning. Compared to static
pruning which happens before execution during the query planning phase, dynamic query
pruning happens on the fly as the query is being executed.
Consider a process that regularly updates existing records in the orders table through
a MERGE command. Under the hood, a MERGE requires a join between the source table
containing the new/updated records and the target table (orders) that we want to update.
Dynamic pruning kicks in during the join. How does it work? As the Snowflake query engine
reads the data from the source table, it can identify the range of records present and
automatically push down a filter operation to the target table to avoid un-necessary data
scanning.
Let’s ground this in an example. Imagine we have a source table containing 3 records we
need to update in the target orders table, which is clustered by order date. A typically
MERGE operation would match records between the two tables using a unique key, such as
order key. Because these unique keys are usually random, they won’t force any query
pruning. If we instead modify the MERGE condition to match records on both order key and
order date, then dynamic query pruning can kick in. As Snowflake reads data from the source
table, it can detect the range of dates covered by the 3 orders we are updating. It can then
push down that range of dates into a filter on the target side to prevent having to scan that
entire large table.
How can you apply this to your day to day? If you currently have
any MERGE or JOIN operations where significant time is spent scanning the target table (on the
right), then consider whether you can introduce additional predicates to your join clause that
will force query pruning. Note, this will only work if (a) your target table is clustered by
some key (b) the source table (on the left) you are joining to contains a tightly bound range of
records on the cluster key (i.e. a subset of order dates).
Using dbt?
When using an incremental materialization strategy in dbt, a MERGE query will be executed
under the hood. To add in an additional join condition to force dynamic pruning, update
the unique_key array to include the extra column (i.e. updated_at).
{{ config(
materialized='incremental',
unique_key=['order_id', 'updated_at'],
) }}
select *
from {{ ref('stg_orders') }}
...
5. Simplify!
Each operation in a query takes time to move data around between worker threads.
Consolidating and removing unnecessary operations reduces the amount of network transfer
required to execute a query. It also helps Snowflake reuse computations and save additional
work. Most of the time, CTEs and subqueries do not impact performance, so use them to help
with readability.
In general, having each query do less makes them easier to debug. Additionally, it reduces the
chance of the Snowflake query optimizer making the wrong decision (i.e. picking the wrong
join order).
Because the QUALIFY filter didn't require information after the join, it could be moved earlier
in the query. This results in significantly less data being joined, vastly improving
performance:
For transformation queries that write to another table, a powerful way of reducing the volume
of data processed is incrementalization. For the example of the orders table, we could
configure the query to only process new or updated orders, and merge those results into the
existing table.
If a grouped or joined column is heavily skewed (meaning a small number of distinct values
occur most frequently), this can have a detrimental impact on Snowflake's speed. A common
example is grouping by a column that contains a significant number of null values. Filtering
rows with these values out and processing them in a separate operation can result in faster
query speeds.
Finally, range joins can be slow in all data warehouses including Snowflake. Your knowledge
of the interval lengths in the data can be used to reduce the join explosion that occurs. Check
out our recent post on this if you're seeing slow range join performance.
select
a.*,
b.*
from model_a as a
on a.id=b.id
This query was repeatedly taking >45 minutes to run, and failing due to a "Incident".
When digging into the query profile (also referred to as the "query plan"), you can see that
the models being queried were actually complex views, with hundreds of tables.
The solution here is to split up the complex view into simpler, smaller parts, and persist them
as tables.
When the warehouse suspends, Snowflake doesn't guarantee that the cache will persist when
the warehouse is resumed. The impact of cache loss is that queries have to re-scan data from
table storage, rather than reading it from the much faster local cache. If warehouse cache loss
is impacting queries, increasing the auto-suspend threshold will help.
Separately, Snowflake has a global result cache which will return results for identical queries
executed within 24 hours provided that data in the queried tables is the same. There are
certain situations which can prevent the global result cache from being leveraged (i.e. if your
query has a non-deterministic function), so be sure to check that you are hitting the global
result cache when expected. If not, you may need to tweak your query or reach out to support
to file a bug.
1. Queries are spilling to remote disk (identifiable via the query profile)
2. Query results are needed faster (typically for user-facing applications)
Queries that spill to remote disk run inefficiently due to the large volumes of network traffic
between the warehouse executing the query, and the remote disk that store data used in
executing the query. Increasing the warehouse size doubles both the available RAM and local
disk, which are significantly faster to access than remote disk. Where remote disk spillage
occurs, increasing the warehouse size can more than double a query's speed. We've gone into
more detail on Snowflake warehouse sizing in the past, and covered how to configure
warehouse sizes in dbt too.
Note, if most of the queries running on the warehouse don't require a larger warehouse, and
you want to avoid increasing the warehouse size for all queries, you can instead consider
using Snowflake's Query Acceleration Service. This service, available on Enterprise edition
and above, can be used to give queries which scan a lot of data additional compute resources.
15. Increase the Max Cluster Count
Multi-cluster warehouses, available on Enterprise edition and above, can be used to create
more instances of the same size warehouse.
If there are periods where warehouse queuing causes queries to not meet their required
processing speeds, consider using multi-clustering or increasing the maximum cluster count
in a warehouse. This will allow the warehouse to track query volumes by adding or removing
clusters.
Unlike warehouse cluster count, Snowflake cannot automatically adjust the size of virtual
warehouses with query volumes. This makes multi-cluster warehouses more cost-effective
for processing volatile query volumes, as each cluster is only billable while in an active state.
Other Resources
If you are looking for more content on Snowflake query optimization, we recommend
exploring the additional video resources below.
1. How should you get started with Snowflake cost optimization? (TL,DR: build up a
holistic understanding of your cost drivers before diving into any optimization efforts)
2. Where most customers are today with their understanding of Snowflake usage
3. How does Snowflake's billing model work (did you know it's actually cheaper to store
data in Snowflake?)
4. The tools offered by Snowflake for cost visibility
5. Methods you have to control costs (resource monitors, query timeouts, and
ACCESS CONTROL - the one no one thinks of!)
6. Where should you start with cost cutting? Start optimizing queries? Or go higher
level?
7. Resources for learning more.
For those looking to get an overview of cost optimization, monitoring and control, this is a
great place to start. The video recording can be found below. There is so much to discuss on
this topic and we didn't get to go super deep, so will have to do a follow up soon!
Part 2
In this episode we go deeper into some important foundational concepts around Snowflake
query optimization:
Part 3
In the final episode of the series, we dive into the most important query optimization
techniques:
Slides
The slides can be viewed here. To navigate the slides, you can click the arrows on the bottom
right, or use the arrow keys on your keyboard. Press either the "esc" or the "o" key to zoom
out into an "overview" mode where you can see all slides. From there, you can again navigate
using the arrows and either click a slide or press "esc"/"o" to focus on it.
Presentation Recording
A recording of the presentation is available on youtube. The presentation starts at 3:29.
If you would like, I am more than happy to come in and give this presentation (or a variation
of it) to your team where they can have the opportunity to ask questions. Send an email
to ian@select.dev if you would like to set that up.
Niall Woodward
Co-founder & CTO of SELECT
Niall is the Co-Founder & CTO of SELECT, a software product which helps users
automatically optimize, understand and monitor Snowflake usage. Prior to starting SELECT,
Niall was a data engineer at Brooklyn Data Company and several startups. As an open-source
enthusiast, he's also a maintainer of SQLFluff, and creator of three dbt packages:
dbt_artifacts, dbt_snowflake_monitoring and dbt_snowflake_query_tags.
Contents
1. Snowflake Query Optimization Techniques
2. How to optimize a Snowflake Query
Ian Whitestone
Niall Woodward
Everything we discuss is based on the real world strategies SELECT has helped over 100
Snowflake customers employ. If you think there's something we've missed, we'd love to hear
from you! Reach out via email or use the chat bubble at the bottom of the screen.
Reducing auto-suspend
Reducing the warehouse size
Ensure minimum clusters are set to 1
Consolidate warehouses
2. Workload configuration
3. Table configuration
Because of the minimum 1 minute billing period, it's possible for users to get double charged
if the auto-suspend is set to 30s. Here's an example:
Despite only being up for ~1 minute, the user will actually be charged for 2 minutes of
compute in this scenario since the warehouse has resumed twice and you are charged for a
minimum of 1 minute each time it resumes.
Here’s the monthly pricing for each warehouse assuming running continuously (though not
typically how warehouses run due to auto-suspend, it gives a better sense of cost than
hourly):
Cost 720 ($1,800) 1,440 ($3,600) 2,880 ($7,200) 5,760 ($14,400) 11,520 ($28
Cost 23,040 ($57,600) 46,080 ($115,200) 92,160 ($230,400) 184,320 ($460,800) 368,640 ($9
Over-sized warehouses can sometimes make up the majority of Snowflake usage. Reduce
warehouse sizes and observe the impact on workloads. If performance is still acceptable, try
reducing size again. Check out our full guide to choosing the right warehouse size in
Snowflake, which includes practical heuristics you can use to identify oversized warehouses.
As a quick practical example, consider a data loading job that loads ten files every hour on a
Small size warehouse. A small size warehouse has 2 nodes and a total of 16 cores available
for processing. This job can at most saturate 10 out of the 16 cores (1 file per core), meaning
this warehouse will not be fully utilized. It would be significantly more cost effective to run
this job on an X-Small warehouse.
4. Consolidate warehouses
A big problem we see with many Snowflake customers is warehouse sprawl. When there
are too many warehouses, many of them will not be fully saturated with queries and they will
sit idle, resulting in unnecessary credit consumption.
Here's an example of the warehouses in our own Snowflake account, visualized in the
SELECT product. We calculate & surface a custom metric called warehouse utilization
efficiency, which looks at the % of time the warehouse is active and processing queries.
Looking at the SELECT_BACKEND_LARGE warehouse in the second row, this warehouse has a
low utilization efficiency of 11%, meaning that 89% of the time we are paying for it, it is
sitting idle and not processing any queries. There are several other warehouses with low
efficiency as well.
The best way to ensure virtual warehouses are being utilized efficiently is to use as few as
possible. Where needed, create separate warehouses based on performance requirements
versus domains of workload.
For example, creating one warehouse for all data loading, one for transformations, and one
for live BI querying will lead to better cost efficiency than one warehouse for marketing data
and one for finance data. All data-loading workloads typically have the same performance
requirements (tolerate some queueing) and can often share a multi-cluster X-Small
warehouse. In contrast, all live, user-facing queries may benefit from a larger warehouse to
reduce latency.
Where workloads within each category (loading, transformation, live querying, etc.) need a
larger warehouse size for acceptable query speeds, create a new larger warehouse just for
those. For best cost efficiency, queries should always run on the smallest warehouse they
perform sufficiently quickly on.
5. Reduce query frequency
At many organizations, batch data transformation jobs often run hourly by default. But do
downstream use cases need such low latency? Here are some examples of how reducing run
frequency can have an immediate impact on cost. In this example, we assume all workloads
are non-incremental and so perform a full data refresh each run, and that the initial cost of
running hourly was $100,000.
Run frequency An
Hourly $100,000
Once at the start of the working day, once around midday $8000
Daily $4000
Rather than reprocessing all data in every batch data transformation job, incrementalization
can be used to filter for only records which are new or updated within a certain time window,
perform the transformations, and then insert or update that data into the final table.
Here's an example of the cost reduction we achieved by converting one of our data models to
only process new data:
7. Ensure tables are clustered correctly
One of the most important query optimization techniques is query pruning: a technique to
reduce the number of micro-partitions scanned when executing a query. Reading micro-
partitions is one of the most expensive steps in a query, since it involves reading data
remotely over the network. If a filter is applied in a where clause, join, or subquery,
Snowflake will attempt to eliminate any micro-partitions it knows don’t contain relevant data.
For this to work, the micro-partitions have to contain a narrow range of values for the column
you're filtering on.
select *
from orders
select
table_catalog as database_name,
table_schema as schema_name,
table_name,
from snowflake.account_usage.table_storage_metrics
limit 10
If your tables are regularly deleted and re-created as part of some ETL process, or if you have
a separate copy of the data available in cloud storage, then there is no need for backing up
their data in most cases. By changing a table from pernanent to transient, you can avoid
spending unnecessarily on Fail Safe and Time Travel backups.
select *
from raw.orders
...
If you instead split this file into ten files that are 100 MB each, you will utilize 10 threads out
of 16. This level parallelization is much better as it leads to better utilisation of the given
compute resources (although it's worth noting that an X-Small would still be the better choice
in this scenario).
Have too many small files can also lead to excessive costs if you are using Snowpipe for data
loading, since Snowflake charges an overhead fee of 0.06 credits per 1000 files loaded.
You can also use access control to limit which users can run queries on certain warehouses.
By only allowing users to use smaller warehouses, you will force them to write more efficient
queries rather than defaulting to running on a larger warehouse size. When required, there can
be policies or processes in place to allow certain queries/users to run on a larger warehouse
wehn absolutely necessary.
Query timeouts are a great way to mitigate the impact of runaway queries. By default, a
Snowflake query can run for two days before it is cancelled, racking up significant costs. We
recommend you put query timeouts in place on all warehouses to mitigate the maximum cost
a single query can incur. See our post on the topic for more advice on how to set these.
In this post, we've shared a bunch of ways you can lower your Snowflake costs to meet your
cost reduction targets, or free up your budget for new workloads. But, getting to a place
where you need these techniques means you have have already incurred these costs and
potentially operated in a suboptimal way for an extended period.
One of the best ways to prevent unnecessary costs is to implement an effective cost
monitoring strategy from the start. Build your own dashboard on top of the Snowflake
account usage views and review it each week, or try out a purpose built cost monitoring
product like SELECT. By catching spend issues early, you can prevent unnecessary costs
from piling up in the first place.
Recording
If you would like, we are more than happy to come in and give this presentation (or a
variation of it) to your team where they can have the opportunity to ask questions. Send an
email to ian@select.dev if you would like to set that up.
Slides
Ian Whitestone
Co-founder & CEO of SELECT
Ian is the Co-founder & CEO of SELECT, a software product which helps users
automatically optimize, understand and monitor Snowflake usage. Prior to starting SELECT,
Ian spent 6 years leading full stack data science & engineering teams at Shopify and Capital
One. At Shopify, Ian led the efforts to optimize their data warehouse and increase cost
observability.
Niall Woodward
Co-founder & CTO of SELECT
Niall is the Co-Founder & CTO of SELECT, a software product which helps users
automatically optimize, understand and monitor Snowflake usage. Prior to starting SELECT,
Niall was a data engineer at Brooklyn Data Company and several startups. As an open-source
enthusiast, he's also a maintainer of SQLFluff, and creator of three dbt packages:
dbt_artifacts, dbt_snowflake_monitoring and dbt_snowflake_query_tags.
Contents
1. Before you start
2. Cost Optimization Techniques
3. 1. Reduce auto-suspend to 60 seconds
4. 2. Reduce virtual warehouse size
5. 3. Ensure minimum clusters are set to 1
6. 4. Consolidate warehouses
7. 5. Reduce query frequency
8. 6. Only process new or updated data
9. 7. Ensure tables are clustered correctly
10.8. Drop unused tables
11.9. Lower data retention
12.10. Use transient tables
13.11. Avoid frequent DML operations
14.12. Ensure files are optimally sized
15.13. Leverage access control
16.14. Enable query timeouts
17.15. Configure resource monitors
18.A final word of advice
19.The Missing Manual: Everything You Need to Know about Snowflake
Cost Optimization (April 2023)
Optimize your Snowflake usage
SELECT automatically optimizes and helps you manage your Snowflake usage with ease.
Ian Whitestone
Snowflake has skyrocketed in popularity over the past 5 years and firmly
planted itself at the center of many companies' data stacks. Snowflake
came into existence in 2012 with a unique architecture, described in
their seminal white paper as "the elastic data warehouse". Rather than
have compute and storage coupled on the same machine like their
competitors did , they proposed a new design that took advantage of the
1
storage.
Cloud Services
The cloud services layer is the entry point for all interactions a user will have
with Snowflake. It consists of stateless services, backed by
a FoundationDB database storing all required metadata. Authentication
and access control (who can access Snowflake and what can they do within
it) are examples of services in this layer. Query compilation and
optimization are other critical roles handled by cloud services. Snowflake
performs performance optimizations like reducing the number of micro-
partitions that a given user's query needs to scan (compile-time pruning).
Cloud services are also responsible for infrastructure and transaction
management. When new virtual warehouses need to be provisioned to
serve a query, cloud services will ensure they become available. If a query is
attempting to access data that is being updated by another transaction, the
cloud services layer waits for the update to complete before results are
returned.
reduce the load on the compute layer , which we'll discuss next.
4
Compute
After a given query has passed through cloud services, it is sent to the
compute layer for execution. The compute layer is composed of all virtual
warehouses a customer has created. Virtual warehouses are an abstraction
over one or more compute instances, or "nodes". For Snowflake accounts
running on Amazon Web Services, a node would be equivalent to a single
EC2 instance. Snowflake uses t-shirt sizing for its warehouses to configure
how many nodes they will have. Customers will typically create separate
warehouses for different workloads. In the image below, we can see a
hypothetical setup with 3 virtual warehouses: a small warehouse used for
business intelligence , an extra-small warehouse used for loading data into
5
the exception of 5XL and 6XL which run on different node specifications.
With each warehouse size increase, the number of nodes in the warehouse
will double. This means that the number of threads, memory, and disk
space will also double. A size small warehouse will have twice as much
memory (32GB), twice as many cores (16) and double the amount of disk
space that an extra-small warehouse will have. By extension, a large
warehouse will have 8 times the resources of an extra-small warehouse.
Storage
Snowflake stores your tables in a scalable cloud storage service (S3 if you
are on AWS, Azure Blob for Azure, etc.). Every table is partitioned into a
7
number of immutable micro-partitions. Micro-partitions use
a proprietary, closed-source file format created by Snowflake. Snowflake
aims to keep them around 16MB, heavily compressed . As a result, there
8
Summary
Snowflake's unique, scalable architecture has allowed it to quickly become
the dominant data warehouse of today. In future posts, we'll dive deeper
into each individual layer in Snowflake's architecture and discuss how you
can take advantage of their features to maximize query
performance and lower costs.
Notes
1
At the time, Snowflake's main competitors were Amazon Redshift and
traditional on-premise offerings like Oracle and Teradata. These existing
solutions all coupled storage and compute on the same machines, making
them difficult and expensive to scale. Today, Snowflake's bigger
competitors are the likes of BigQuery and Databricks. BigQuery likely has a
similar market share, if not greater, due to their seamless integration with
the rest of their Google Cloud Platform. Databricks has become a new
competitor as both companies are beginning to re-position themselves as
"data clouds". ↩
2
With their move to become a full-on "data cloud", Snowflake is rapidly
adding new functionality like Snowpark, Unistore, External Tables, Streamlit
and a native App store - all of which extend Snowflake's architecture. We'll
be ignoring these new capabilities in this architecture review, and focusing
on the data warehousing aspects that most customers use as of today. ↩
3
The queries must be identical in order to be served from the global result
cache. In addition, Snowflake actually has two different caches which can
benefit performance: a global result cache and a local cache in each
warehouse. We'll cover both in more detail in a future post. ↩
4
In addition to serving previously run queries from the global result cache,
Snowflake can also process certain queries
like count(*) or max(column) entirely by leveraging the metadata storage.
Learn more in our post about micro-partitions ↩
5
This warehouse is actually a multi-cluster warehouse, which means
Snowflake will allocate additional compute resources if the query demand
surpasses what a single small warehouse can handle. We'll cover multi-
cluster warehouses in more depth in a future post. ↩
6
These figures are for AWS, and will differ slightly for other cloud providers.
They are not guaranteed to be accurate since Snowflake does not publish
them, and can change the underlying servers and warehouse configurations
at any point. These figures I provided were last validated in August 2022
through two separate sources. Appears to be consistent with what was
observed in 2019. I have not been able to validate the disk space available
on each node, but plan to figure this out experimentally in the coming
months. ↩
7
The exception to this is if you are using external tables to store you
data. ↩
8
This compression is done automatically by Snowflake under the hood.
Uncompressed, these files can be over 500MB!
Calculating cost per query in Snowflake
Date
Wednesday, October 12, 2022
Ian Whitestone
For most Snowflake customers, compute costs (the charges for virtual
warehouses), will make up the largest portion of the bill. To effectively
reduce this spend, high cost-driving queries need to be accurately
identified.
Snowflake customers are billed for each second that virtual warehouses
1
are running, with a minimum 60 second charge each time one is resumed.
The Snowflake UI currently provides a breakdown of cost per virtual
warehouse, but doesn't attribute spend at a more granular, per-query level.
This post provides a detailed overview and comparison of different ways to
attribute warehouse costs to queries, along with the code required to do
so.
Skip to the final SQL?
If you want to skip ahead and see the SQL implementation for the
recommended approach, you can head straight to the end!
Simple approach
We'll start with a simple approach which multiplies a query's execution time
with the billing rate for the warehouse it ran on. For example, say a query
ran for 10 minutes on a medium size warehouse. A medium warehouse
costs 4 credits per hour, and with a cost of $3 per credit , we'd say this 2
SQL Implementation
We can implement this in SQL by leveraging
the snowflake.account_usage.query_history view which contains all queries from
the last year along with key metadata like the total execution time and size
of the warehouse the query ran on:
WITH
warehouse_sizes AS (
)
SELECT
qh.query_id,
qh.query_text,
qh.execution_time/(1000*60*60)*wh.credits_per_hour AS query_cost
FROM snowflake.account_usage.query_history AS qh
ON qh.warehouse_size=wh.warehouse_size
WHERE
This gives us an estimated query cost for each query_id. To account for the
same query being run multiple times in a period, we can aggregate by
the query_text:
WITH
warehouse_sizes AS (
// same as above
),
queries AS (
SELECT
qh.query_id,
qh.query_text,
qh.execution_time/(1000*60*60)*wh.credits_per_hour AS query_cost
FROM snowflake.account_usage.query_history AS qh
INNER JOIN warehouse_sizes AS wh
ON qh.warehouse_size=wh.warehouse_size
WHERE
SELECT
query_text,
SUM(query_cost) AS total_query_cost_last_30d
FROM queries
GROUP BY 1
SELECT
id,
created_at
FROM orders
SELECT
id,
created_at
FROM orders
Similar to Looker, dbt will add its own metadata, giving each query a
unique invocation_id:
SELECT
id,
created_at
FROM orders
/*{
"app": "dbt",
"invocation_id": "52c47806ae6d",
"node_id": "model.jaffle_shop.orders",
...
}*/
When grouped by query_text, the two occurrences of the query above won't
be linked since this metadata makes each one unique. This could result in a
single and potentially easily addressed source of high cost queries (for
example a dashboard) going unidentified.
We may wish to go even further and bucket costs at a higher level. dbt
models often consist of multiple queries being run: a CREATE TEMPORARY
TABLE followed by a MERGE statement. A given dashboard may trigger 5
different queries each time it is refreshed. Being able to group the entire
collection of queries from a single origin is very useful for attributing spend
and then targeting improvements in a time efficient manner.
New approach
To be able to reconcile the total attributed query costs with the final bill, it's
important to start with the exact charges for each warehouse. The decision
to use an hourly granularity comes
from snowflake.account_usage.warehouse_metering_history , the source of truth for
warehouse charges, which reports credit consumption at an hourly level.
We can then calculate how many seconds each query spent executing in
the hour, and allocate the credits proportionally to each query based on
their fraction of the total execution time. In doing so, we will account for
idle time by distributing it among the queries that ran during the period.
Concurrency will also be handled since more queries running will generally
lower the average cost per query.
When only one query runs in an hour, like Query 5 below, all credit
consumption is attributed to that one query, including the credits
consumed by the warehouse sitting idle.
SQL Implementation
Some queries don't execute on a warehouse and are processed entirely by
the cloud services layer. To filter those, we remove queries
with warehouse_size IS NULL . We'll also calculate a new
3
SELECT
query_id,
query_text,
warehouse_id,
TIMEADD(
'millisecond',
queued_overload_time + compilation_time +
queued_provisioning_time + queued_repair_time +
list_external_files_time,
start_time
) AS execution_start_time,
end_time
FROM snowflake.account_usage.query_history AS q
WHERE TRUE
query_id execution_start_time
We need to generate a table with one row per hour that the query ran
within.
WITH
filtered_queries AS (
SELECT
query_id,
query_text,
warehouse_id,
TIMEADD(
'millisecond',
queued_overload_time + compilation_time +
queued_provisioning_time + queued_repair_time +
list_external_files_time,
start_time
) AS execution_start_time,
end_time
FROM snowflake.account_usage.query_history AS q
WHERE TRUE
hours_list AS (
SELECT
DATEADD(
'hour',
) as hour_start,
),
query_hours AS (
SELECT
hl.hour_start,
hl.hour_end,
queries.*
FROM hours_list AS hl
Now we can calculate the number of milliseconds each query ran for within
each hour along with their fraction relative to all queries.
query_seconds_per_hour AS (
SELECT
*,
num_milliseconds_query_ran/total_query_milliseconds_in_hour AS
fraction_of_total_query_time_in_hour,
hour_start AS hour
FROM query_hours
),
credits_billed_per_hour AS (
SELECT
start_time AS hour,
warehouse_id,
credits_used_compute
FROM snowflake.account_usage.warehouse_metering_history
),
query_cost AS (
SELECT
query.*,
credits.credits_used_compute*2.28 AS actual_warehouse_cost,
credits.credits_used_compute*fraction_of_total_query_time_in_hour*2.28 AS
query_allocated_cost_in_hour
ON query.warehouse_id=credits.warehouse_id
AND query.hour=credits.hour
SELECT
query_id,
ANY_VALUE(MD5(query_text)) AS query_signature,
ANY_VALUE(query_text) AS query_text,
SUM(query_allocated_cost_in_hour) AS query_cost,
ANY_VALUE(warehouse_id) AS warehouse_id,
FROM query_cost
GROUP BY 1
SELECT
id,
total_price, -- So is this
FROM orders
/*
Woo!
*/
SELECT
query_text AS original_query_text,
-- First, we remove comments enclosed by /* <comment text> */
FROM snowflake.account_usage.query_history AS q
This approach also does not take into account the minimum 60-second
billing charge. If there are two queries run separately in a given hour, and
one takes 1 second to execute and another takes 60 seconds, the second
query will appear 60 times more than expensive than the first query, even
though that first query consumes 60 seconds worth of credits.
The query_text processing technique has room for improvement too. It's not
uncommon for incremental data models to have hardcoded dates
generated into the SQL, which change on each run. For example:
SELECT
...
FROM orders
WHERE
You can also see this behaviour in parameterized dashboard queries. For
example, a marketing dashboard may expose a templated query:
SELECT
id,
FROM customers
WHERE
country_code = {{ selected_country_code }}
Each time this same query is run, it is populated with different values:
SELECT
id,
FROM customers
WHERE
country_code = 'CA'
While the parameterized queries can be handled with more advanced SQL
text processing, idle and minimum billing times are trickier. At the end of
the day, the purpose of attributing warehouse costs to queries is to help
users determine where they should focus their time. With this current
approach, we strongly believe it will let you accomplish this goal. All
models are wrong, but some are useful.
Notes
1
Snowflake uses the concept of credits for most of its billable services.
When warehouses are running, they consume credits. The rate at which
credits are consumed doubles each time the warehouse size is increased.
An X-Small warehouse costs 1 credit per hour, a small costs 2 credits per
hour, a medium costs 4 credits per hour, etc. Each Snowflake customer will
pay a fixed rate per credit, which is how the final dollar value on the
monthly bill is calculated. ↩
2
The cost per credit will vary based on the plan you are on (Standard,
Enterprise, Business Critical, etc..) and your contract. On demand customers
will generally pay $2/credit for Standard, and $3/credit on Enterprise. If you
sign an annual contract with Snowflake, this rate will get discounted based
on how many credits you purchase up front. All examples here are in US
dollars. ↩
3
It is possible for queries to run without a warehouse by leveraging the
metadata in cloud services. ↩
4
There are a number of things that need to happen before a query can
begin executing in a warehouse, such as query compilation in cloud
services and warehouse provisioning. In a future post we'll dive deep into
the lifecycle of a Snowflake query. ↩
5
, the REGEX '(/\*.*\*/)' won't work for two comments on the same line,
such as /* hi */SELECT * FROM table/* hello there */ ↩
Appendix
Complete SQL Query
For a Snowflake account with ~9 million queries per month, the query
below took 93 seconds on an X-Small warehouse.
WITH
filtered_queries AS (
SELECT
query_id,
query_text AS original_query_text,
warehouse_id,
TIMEADD(
'millisecond',
queued_overload_time + compilation_time +
queued_provisioning_time + queued_repair_time +
list_external_files_time,
start_time
) AS execution_start_time,
end_time
FROM snowflake.account_usage.query_history AS q
WHERE TRUE
-- 1 row per hour from 30 days ago until the end of today
hours_list AS (
SELECT
DATEADD(
'hour',
) as hour_start,
),
query_hours AS (
SELECT
hl.hour_start,
hl.hour_end,
queries.*
FROM hours_list AS hl
),
query_seconds_per_hour AS (
SELECT
*,
num_milliseconds_query_ran/total_query_milliseconds_in_hour AS
fraction_of_total_query_time_in_hour,
hour_start AS hour
FROM query_hours
),
credits_billed_per_hour AS (
SELECT
start_time AS hour,
warehouse_id,
credits_used_compute
FROM snowflake.account_usage.warehouse_metering_history
),
query_cost AS (
SELECT
query.*,
credits.credits_used_compute*2.28 AS actual_warehouse_cost,
credits.credits_used_compute*fraction_of_total_query_time_in_hour*2.28 AS
query_allocated_cost_in_hour
ON query.warehouse_id=credits.warehouse_id
AND query.hour=credits.hour
),
cost_per_query AS (
SELECT
query_id,
ANY_VALUE(MD5(cleaned_query_text)) AS query_signature,
SUM(query_allocated_cost_in_hour) AS query_cost,
ANY_VALUE(original_query_text) AS original_query_text,
ANY_VALUE(warehouse_id) AS warehouse_id,
FROM query_cost
GROUP BY 1
SELECT
query_signature,
COUNT(*) AS num_executions,
AVG(query_cost) AS avg_cost_per_execution,
SUM(query_cost) AS total_cost_last_30d,
ANY_VALUE(original_query_text) AS sample_query_text
FROM cost_per_query
GROUP BY 1
Niall Woodward
Introduction
I had the pleasure of attending dbt’s Coalesce conference in London last
week, and dropped into a really great talk by Felipe Leite and Stephen
Pastan of Miro. They mentioned how they’d achieved a considerable speed
improvement by switching database clones out for multiple table clones. I
had to check it out.
Experiments
Results were collected using the following query:
select
count(*) as query_count,
sum(credits_used_cloud_services) as credits_used_cloud_services
Setup
Create a database with 10 schemas, 100 tables in each:
import snowflake.connector
con = snowflake.connector.connect(
...
Results:
Query count Duration Cloud servi
import snowflake.connector
cursor = con.cursor(DictCursor)
for i in cursor.fetchall():
con = snowflake.connector.connect(
...
session_parameters={
clone_database_by_schema("test", "test_2")
Results:
12 1m 47s 0.148
Using execute_async executes each SQL statement without waiting for each
to complete, resulting in all 10 schemas being cloned concurrently. A
whopping 10x faster from start to finish compared with the regular
database clone.
import snowflake.connector
cursor = con.cursor(DictCursor)
con = snowflake.connector.connect(
...
session_parameters={
},
clone_database_by_table("test", "test_3")
This took 1 minute 48s to complete, the limiting factor being the rate at
which the queries could be dispatched by the client (likely due to network
waiting times). To help mitigate that, I distributed the commands across 10
threads:
import snowflake.connector
from snowflake.connector import DictCursor
import threading
class ThreadedRunCommands():
self.threads = threads
self.register_command_thread = 0
self.thread_commands = [
[] for _ in range(self.threads)
self.con = con
self.thread_commands[self.register_command_thread].append(command)
if self.register_command_thread + 1 == self.threads:
self.register_command_thread = 0
else:
self.register_command_thread +=1
self.run_command(command)
def run(self):
procs = []
for v in self.thread_commands:
procs.append(proc)
proc.start()
proc.join()
cursor = con.cursor(DictCursor)
results = cursor.fetchall()
schemas_to_create = {r['schema_name'] for r in results}
threaded_run_commands.run()
con = snowflake.connector.connect(
...
session_parameters={
Results:
Query count Duration Cloud servi
Using 10 threads, the time between the create database command starting
and the final create table ... clone command completing was only 22
seconds. This is 60x faster than the create database ... clone command.
The bottleneck is still the rate at which queries can be dispatched.
In Summary
The complete results:
All the queries ran were cloud services only, and did not require a running
warehouse or resume a suspended one.
I hope that Snowflake improves their schema and database clone
functionality, but in the mean time, cloning tables seems to be the way to
go.
Thanks again to Felipe Leite and Stephen Pastan of Miro for sharing this!
3 Ways to Achieve Effective Clustering
in Snowflake
Date
Saturday, November 12, 2022
Niall Woodward
In our previous post on micro-partitions, we dove into how Snowflake's unique storage
format enables a query optimization called pruning. Pairing query design with effective
clustering can dramatically improve pruning and therefore query speeds. We'll explore how
and when you should leverage this powerful Snowflake feature.
Snowflake maintains minimum and maximum value metadata for each column in each micro-
partition. In this table, each micro-partition contains records for a narrow range
of created_at values, so the table is well-clustered on the column. The following query only
scans the first three micro-partitions highlighted, as Snowflake knows it can ignore the rest
based on the where clause and minimum and maximum value micro-partition metadata. This
behavior is called query pruning.
select *
from orders
Unsurprisingly, the impact of scanning only three micro-partitions instead of every micro-
partition is that the query runs considerably faster.
Pruning is arguably the most powerful optimization technique available to Snowflake users,
as reducing the amount of data scanned and processed is such a fundamental principle in big
data processing: “The fastest way to process data? Don’t.”
Snowflake’s documentation suggests that clustering is only beneficial for tables containing
“multiple terabytes (TB) of data”. In our experience, however, clustering can have
performance benefits for tables starting at hundreds of megabytes (MB).
Frequently used where clause filtering keys are good choices for clustering keys. For
example:
select *
from table_a
The above query will benefit from a table that is well-clustered on the created_at column,
as similar values would be contained within the same micro-partition, resulting in only a
small number of micro-partitions being scanned. This pruning determination is performed by
the query compiler in the cloud services layer, prior to execution taking place.
1. Natural clustering
Suppose there is an ETL process adding new events to an events table each hour. A
column inserted_at represents the time at which events are loaded into the table. Newly
created micro-partitions will each have a tightly bound range of inserted_at values. This
events table would be described to be naturally clustered on the inserted_at column. A
query that filters this table on the inserted_at column will prune micro-partitions
effectively.
When performing a backfill of a table that you'd like to leverage natural, insertion-order
clustering on, make sure to sort the data by the natural clustering key first. That way the
historic records are well-clustered, as well as the new ones that get inserted.
Pros
No additional expenditure or effort required
Cons
Only works for queries that filter on a column that correlates to the order in which
data is inserted
The automatic clustering service uses Snowflake-managed compute resources to perform the
re-clustering operation. This service only runs if a 'clustering key' has been set for a table:
The automatic clustering service performs work in the background to create and destroy
micro-partitions so they contain tightly bound ranges of records based on the specified
clustering key. This service is charged based on how much work Snowflake performs, which
depends on the clustering key, the size of the table and how frequently its contents are
modified. Consequently, tables that are frequently modified (inserts, updates, deletes) will
incur higher automatic clustering costs. It's worth noting that the automatic clustering service
only uses the first 5 bytes of a column when performing re-clustering. This means that
column values with the same first few characters won't cause the service to perform any re-
clustering.
The automatic clustering service is simple to use, but easy to spend money with. If you
choose to use it, make sure to monitor both the cost and impact on queries on the table to
determine if it achieves a good price/performance ratio. If you're interested in learning more
about the automatic clustering service, check out this detailed post on the inner workings by
one of Snowflake's engineers.
Pros
The lowest effort way to cluster on a different key to the natural key.
Doesn't block or interfere with DML operations.
Cons
Unpredictable costs.
Snowflake takes a higher margin on automatic clustering than warehouse compute
costs, which can make automatic clustering less cost-effective than manual re-sorting.
3. Manual sorting
With fully recreated tables
If a table is always fully recreated as part of a transformation/modeling process, the table can
be perfectly clustered on any key by adding an order by statement to the create table as
(CTAS) query:
with transformations as (
...
select *
from transformations
order by my_cluster_key
)
In this scenario of a table that is always fully recreated, we recommend always using manual
sorting over the automatic clustering service as the table will be well-clustered, and at a much
lower cost than the automatic clustering service.
On existing tables
Manually re-sorting an existing table on a particular key simply replaces the table with a
sorted version of itself. Let’s suppose we have a sales table with entries for lots of different
stores, and most queries on the table always filter for a specific store. We can perform the
following query to ensure that the table is well-clustered on the store_id:
As new sales are added to the table over time, the existing micro-partitions will remain well-
clustered by store_id, but new micro-partitions will contain records for lots of different
stores. That means that older micro-partitions will prune well, but new micro-partitions won't.
Once performance decreases below acceptable levels, the manual re-sorting query can be run
again to ensure that all the micro-partitions are well-clustered on store_id.
The benefit of manual re-sorting over the automatic clustering service is complete control
over how frequently the table is re-clustered, and the associated spend. However, the danger
of this approach is that any DML operations which occur on the table while the create or
replace table operation is running will be undone. Manual re-sorting should only be used
on tables with predictable or pausable DML patterns, where you can be sure that no DML
operations will run while the re-sort is taking place.
Pros
Provides complete control over the clustering process.
Lowest cost way to achieve perfect clustering on any key.
Cons
Higher effort than the automatic clustering service. Requires the user to either
manually execute the sorting query or implement automated orchestration of the
sorting query.
Replacing an existing table with a sorted version of itself reverses any DML
operations which run during the re-sort.
Which clustering strategy should you use and when?
Always aim to leverage natural clustering as by definition it requires no re-clustering of a
table. Transformation processes that use incremental data processing to only process
new/updated data should always use add an inserted_at or updated_at column for this
reason, as these will be naturally clustered and produce efficient pruning.
It’s common to see that most queries for an organization filter by the same columns, such
as region or store_id. If queries with common filtering patterns are causing full table scans,
then depending on how the table is populated, consider using automatic clustering or manual
re-sorting to cluster on the filtered column. If you’re not sure how you’d implement manual
re-sorting or there's a risk of DML operations running during the re-sort, use the automatic
clustering service.
Other good candidates for re-clustering are tables queried on a timestamp column which
doesn't always correlate to when the data was inserted, so natural clustering can't be used. An
example of this is an events table which is frequently queried on event_created_at or
similar, but events can arrive late and so micro-partitions have time range overlap. Re-
clustering the table on the event_created_at will ensure the queries prune well.
Regardless of the clustering approach chosen, it’s always a good idea to sort data by the
desired clustering key before inserting into the table.
Closing
Ultimately, pruning is achieved with complementary query design and table clustering. The
more data, the more powerful pruning is, with the potential to improve a query's performance
by orders of magnitude.
We’ll go deeper on the topic of clustering in future posts, including the use of
Snowflake’s system$clustering_information function to analyze clustering statistics.
We'll also explore options for when a table needs to be well-clustered on more than one
column, so be sure to subscribe to our mailing list below. Thanks for reading, and please get
in touch via Twitter or email where we'd be happy to answer questions or discuss these
topics in more detail.
Niall Woodward
Defining multiple cluster keys in Snowflake
with materialized views
Date
Sunday, November 20, 2022
Ian Whitestone
-- 1,500,000,000 records
o_orderdate,
o_orderkey,
o_custkey,
o_clerk
from snowflake_sample_data.tpch_sf1000.orders
A common scenario goes like this. The finance team regularly query specific
date ranges on this table in order to understand our sales volume. We also
have our engineering teams querying this table to investigate specific
orders. On top of that, marketing wants the ability to see all historical
orders for a given customer.
select
o_clerk
from snowflake_sample_data.tpch_sf1000.orders
)
select
o_orderdate,
count(*) as cnt
from orders
group by 1
Running a query against a range of dates, we can see from the query profile
that we are getting excellent query pruning. Only 22 out of 1609 micro-
partitions are being scanned.
Access Pattern 2: Query for a specific customer
select *
from orders
When we change our query to look up all orders for a particular customer,
the query pruning is ineffective with 99% of all micro-partitions being
scanned.
Access Pattern 3: Query for a specific order
select *
from orders
For our orders based lookup, the third column of our cluster key, we see no
pruning whatsoever with all micro-partitions being scanned in order to find
our 1 record.
Understanding the degraded performance of multi-column
cluster keys
As demonstrated above, the query pruning performance degrades
significantly for predicates (filters) on the second and third columns.
select
o_orderdate,
o_orderkey,
o_custkey,
o_clerk
from snowflake_sample_data.tpch_sf1000.orders
select
o_orderdate,
o_orderkey,
o_custkey,
o_clerk
from snowflake_sample_data.tpch_sf1000.orders
select
o_orderdate,
o_orderkey,
o_custkey,
o_clerk
from snowflake_sample_data.tpch_sf1000.orders
This approach has clear downsides. Users now have to keep track of the
three different tables and remember which query scenario they should use
each table. Not practical for a table that is widely used. You would also be
responsible for maintaining the three separate copies of this table in your
ETL/ELT pipelines.
future post, but for now you can read Snowflake docs which cover them in
great detail. When you create a materialized view, like the one below,
Snowflake automatically maintains this derived dataset on your behalf.
When data is added or modified in the base table ( orders), Snowflake
automatically updates the materialized view.
select
o_orderdate,
count(*) as cnt
from orders
group by 1
Now, if anyone ever runs this query against the base table:
select
o_orderdate,
count(*) as cnt
from orders
group by 1
-- these will take some time to execute, since the entire dataset is
select
o_orderdate,
o_orderkey,
o_custkey,
o_clerk
from orders
;
create materialized view orders_clustered_by_order cluster by(o_orderkey) as (
select
o_orderdate,
o_orderkey,
o_custkey,
o_clerk
from orders
select
o_orderdate,
o_orderkey,
o_custkey,
o_clerk
from snowflake_sample_data.tpch_sf1000.orders
order by o_orderdate
)
Re-testing our three access patterns
Access Pattern 1: Query by date
select
o_orderdate,
count(*) as cnt
from orders
where
group by 1
select *
from orders
where
o_custkey=52671775
Look closely!
Note, we don’t have to re-write our query to explicitly tell Snowflake to
query the materialized view, it does this under the hood. Users don’t have
to remember which dataset to query under different scenarios!
select *
from orders
where
o_orderkey = 5019980134
We’ll provide more guidance on this in a future post, but for now we
recommend monitoring the maintenance costs and automatic clustering
3
costs associated with your materialized views. Your can estimate your
4
storage costs upfront based on the table size and your storage costs . 5
update orders
where o_orderkey=5019980134
The query will do a full table scan on the base orders table and not use the
materialized view.
Summary
In this post we showed how materialized views can be leveraged to create
multiple versions of a table with different clusters keys. This practice can
help significantly improve query performance due to better pruning and
even lower the virtual warehouse costs associated with those queries. As
with anything in Snowflake, these benefits must be carefully considered
against their underlying costs.
In future posts, we’ll explore important topics like how to determine the
optimal cluster keys for your table, estimating the costs of automatic
clustering for a large table, and how to monitor clustering health and
implement more cost effective automatic clustering. We’ll also dive deeper
into defining multiple cluster keys on a single table and when it makes
sense to do so.
As always, don’t hesitate to reach out via Twitter or email where we'd be
happy to answer questions or discuss these topics in more detail. If you
want to get notified when we release a new post, be sure to sign up for our
Snowflake newsletter at the bottom of this page.
Notes
1
Notice how we order the clustering keys from lowest to highest
cardinality? From the Snowflake documentation on multi-column
cluster keys:
If you are defining a multi-column clustering key for a table, the order in
which the columns are specified in the CLUSTER BY clause is important. As a
general rule, Snowflake recommends ordering the columns
from lowest cardinality to highest cardinality. Putting a higher cardinality
column before a lower cardinality column will generally reduce the
effectiveness of clustering on the latter column.
A column’s cardinality is simply the number of distinct values. You can find
this out by running a query:
select
count(*), -- 1,500,000,000
from public.orders
2
You can only use materialized views if you are on the enterprise (or above)
edition of Snowflake. ↩
3
You can monitor the cost of your materialized view refreshes using the
following query:
select
table_name as materialized_view_name,
sum(credits_used) as num_credits_used
from snowflake.account_usage.materialized_view_refresh_history
group by 1,2
order by 1,2
4
You can monitor the cost of automatic clustering on your materialized
view using the following query:
select
sum(credits_used) as num_credits_used
from snowflake.account_usage.automatic_clustering_history
on automatic_clustering_history.table_id=tables.table_id
group by 1,2
order by 1,2
5
Most customers on AWS pay $23/TB/month. So if your base table is 10TB,
then each additional materialized view will cost $2,760 / year ( 10*23*12). ↩
Choosing the right warehouse size
in Snowflake
Date
Sunday, November 27, 2022
Niall Woodward
The days of complex and slow cluster resizing are behind us; Snowflake
makes it possible to spin up a new virtual warehouse or resize an existing
one in a matter of seconds. The implications of this are:
Before we look to answer that question, let's first understand what a virtual
warehouse is, and the impact of size on its available resources and query
processing speed.
While the nodes in each warehouse are physically separated, they operate
in harmony, and Snowflake can utilize all the nodes for a single query.
Consequently, we can work on the basis that each warehouse size increase
doubles the available compute cores, RAM, and disk space available.
X-Small 1
Small 2
Medium 4
Large 8
X-Large 16
2X-Large 32
3X-Large 64
Warehouse Size Credits
4X-Large 128
5X-Large 256
6X-Large 512
Cost vs performance
CPU-bound queries will double in speed as the warehouse size increases,
up until the point at which they no longer fully utilize the warehouse’s
resources. Ignoring warehouse idle times from auto-suspend thresholds, a
query which runs twice as fast on a medium than a small warehouse will
cost the same amount to run, as cost = duration x credit usage rate . The
below graph illustrates this behavior, showing that at a certain point, the
execution time for bigger warehouses remains the same while the cost
increases. So, how do we find that sweet spot of maximum performance for
the lowest cost?
Determining the best warehouse size for a
Snowflake query
Here’s the process we recommend:
A warehouse can run more than one query at a time, so where possible
keep warehouses fully loaded and even with light queueing for maximum
efficiency. Warehouses for non-user queries such as transformation
pipelines can often be run at greater efficiency due to the tolerance for
queueing.
1. The number of queries by execution time. Here you can see that over
98% of the queries running on this warehouse are taking less than 1
second to execute.
2. The number of queries by utilizable warehouse size. Utilizable
warehouse size represents the size of warehouse a query can fully
utilize. Where lots of queries don't utilize the warehouse's size, it
indicates that the warehouse is oversized or the queries should run
on a smaller warehouse. In this example, over 96% of queries being
run on the warehouse aren’t using all 8 nodes available in the Large
warehouse.
Using partitions scanned as a heuristic
Another helpful heuristic is to look at how many micro-partitions a query
is scanning, and then choose the warehouse size based off that. This
strategy comes from Scott Redding, a resident solutions architect at
Snowflake.
The intuition behind this strategy is that the number of threads available for
processing doubles with each warehouse size increase, and each thread can
process a single micro-partition at a time. You want to ensure that each
thread has plenty of work available (files to process) throughout the query
execution.
To interpret this chart, this goal is to aim for 250 micro-partitions per
thread. If your query needs to scann 2000 micro-partitions, then running
the query on an X-Small will give each thread 250 micro-partitions (files) to
process, which is ideal. Compare this with running the query on a 3XL
warehouse, which has 512 threads. Each of these threads will only get 4
micro-partitions to process, which will likely results in many threads sitting
unused.
The main pitfall with this approach is that while micro-partitions scanned is
a significant factor in the query execution, other factors like query
complexity, exploding joins, and volume of data sorted will also impact the
required processing power.
Closing
Snowflake makes it easy to match workloads to warehouse configurations,
and we’ve seen queries more than double in speed while costing less
money by choosing the correct warehouse size. Increasing warehouse size
isn't the only option available to make a query run faster though, and many
queries can be made to run more efficiently by identifying and resolving
their bottlenecks. We'll provide a detailed guide on query optimization in a
future post, but if you haven't yet, check out our previous post on
clustering.
Ian Whitestone
The Snowflake Query Profile is the single best resource you have to
understand how Snowflake is executing your query and learn how to
improve it. In this post we cover important topics like how to interpret the
Query Profile and the things you should look for when diagnosing
poor query performance.
select
date_trunc('day', event_timestamp) as date,
count(*) as num_events
from events
group by 1
order by 1
1. TableScan :
reads the records from the events table in remote storage. It
passes 1.3 million records through a link to the next operator.
1
Alternatively, you can navigate to the "Query History" page under the
"Activity" tab. For any query run in the last 14 days, you can click on it and
see the Query Profile.
If you already have the query_id handy, you can take advantage of
Snowflake's structure URLs by filling out this URL template:
Template: https://app.snowflake.com/<snowflake-region>/<account-
locator>/compute/history/queries/<paste-query-id-here>/profile
Filled out
example: https://app.snowflake.com/us-east4.gcp/xq35282/compute/history/
queries/01a8c0a5-0000-0b5e-0000-2dd500044a26/profile
count(*) as num_orders,
sum(o_totalprice) as total_order_value
from snowflake_sample_data.tpch_sf1000.orders
where
year(o_orderdate)=1997
group by order_month
order by order_month
The Query Profile also contains some useful statistics. To highlight a few:
1. An execution time summary. This shows what % of the total query
execution time was spent on different buckets. The 4 options listed
here include:
1. Processing: time spent on query processing operations like
joins, aggregations, filters, sorts, etc.
2. Local Disk I/O: time spent reading/writing data from/to local
SSD storage. This would include things like spilling to disk, or
reading cached data from local SSD.
3. Remote Disk I/O: time spent reading/writing data from/it
remote storage (i.e. S3 or Azure Blob storage). This would
include things like spilling to remote disk, or reading your
datasets.
4. Initialization: this is an overhead cost to start your query on the
warehouse. In our experience, it is always extremely small and
relatively constant.
2. Query statistics. Information like the number of partitions scanned
out of all possible partitions can be found here. Note that this is
across all tables in the query. Fewer partitions scanned means
the query is pruning well. If your warehouse doesn't have enough
memory to process your query and is spilling to disk, this information
will be reflected here.
3. Number of records shared between each node. This information is
very helpful to understand the volume of data being processed, and
how each node is reducing (or expanding) that number.
4. Percentage of total execution time spent on each node. Shown on
the top right of each node, it indicates the percentage of total
execution time spent on that operator. In this example, 83.2% of the
total execution time was spent on the TableScan operator. This
information is used to populate the "Most Expensive Nodes" list at
the top right of the Query Profile, which simply sorts the nodes by the
percentage of total execution time.
You may notice that the number of rows in/out of the Filter node are the
same, implying that the year(o_orderdate)=1997 SQL code did not accomplish
anything. The filter is eliminating records though, as this table contains 1.5
billion records. This is an unfortunate pitfall of the Query Profile; it does not
show the exact number of records being removed by a particular filter.
select
count(*) as num_orders,
sum(o_totalprice) as total_order_value
from snowflake_sample_data.tpch_sf1000.orders
where
group by order_month
order by order_month
Unlike above, the query plan now contains two steps. First, Snowflake
executes the subquery and calculates the average o_totalprice. The result is
stored, and used in the second step of the query which has the same 5
operators as our query above.
Complex Query
Here is a slightly more complex query with multiple CTEs, one of which is
2
with
daily_shipments AS (
select
l_shipdate,
sum(l_quantity) AS num_items
from snowflake_sample_data.tpch_sf1000.lineitem
where
),
daily_summary as (
select
o_orderdate,
count(*) AS num_orders,
any_value(num_items) AS num_items
from snowflake_sample_data.tpch_sf1000.orders
on orders.o_orderdate=daily_shipments.l_shipdate
group by 1
),
summary_stats as (
select
min(num_items) as min_num_items,
max(num_items) as max_num_items
from daily_shipments
select
daily_summary.*,
summary_stats.*
from daily_summary
A full mapping of the SQL code to the relevant operator nodes can be
found in the notes below .3
In future posts, we'll dive into each of these signals in more detail and share
strategies to resolve them.
Notes
1
Not all 1.3 million records are sent at once. Snowflake has a vectorized
execution engine. Data is processed in a pipelined fashion, with batches of
a few thousand rows in columnar format at a time. This is what allows an
XSMALL warehouse with 16GB of RAM to process datasets much larger
than 16GB. ↩
2
I wouldn't pay too much attention to what this query is calculating or the
way it's written. It was created solely for the purpose of yielding an
interesting example query profile. ↩
3
For readers interested in improving their ability to read Snowflake Query
Profiles, you can use the example query from above to see how each CTE
maps to the different sections of the Query Profile.
Ian Whitestone
select *
from table
where id = 5
Users often do this in order to investigate a particular record or subset of records. But what
happens when you want to exclude a specific column (like a large text column), or rename
one of them? Users are forced to type out all of the columns they want, along with any
desired renaming using the traditional AS syntax:
select
column_1,
column_2,
column_3 as column_3_renamed,
column_4,
column_5,
column_6,
column_7,
column_8,
column_12
from table
where id = 5
select
exclude(column_10, column_11)
from table
where id = 5
select
exclude (column_10)
from table
where id = 5
Brackets are optional when excluding a single column. The SQL above can be rewritten as:
select
exclude column_10
from table
where id = 5
from orders
join customers
on orders.customer_id=customers.customer_id
join items
on orders.order_id=items.order_id
In this example, we have two order_id columns and two customer_id columns in the
output, since they are present in multiple tables. We can easily exclude these by changing our
SQL to:
select
orders.*,
from orders
join customers
on orders.customer_id=customers.customer_id
join items
on orders.order_id=items.order_id
-- Snowflake
select
*
exclude(column_10, column_11)
from table
-- BigQuery/Databricks
select
except(column_10, column_11)
from table
For anyone familiar with DuckDB, Snowflake follows the same EXCLUDE syntax that they
use. Both databases presumably chose to avoid using the EXCEPT keyword to provide this
functionality, since it is already used in set operations.
select
from table
where id = 5
This is much better than having having to list out every field just to rename one or two:
-- old method 👎
select
column_1,
column_2,
column_3 as column_3_renamed,
column_4,
column_5 as column_5_renamed,
column_6,
...
column_12
from table
where id = 5
Similar to EXCLUDE, this can also be done for a single column, brackets optional:
select
from table
where id = 5
select
exclude(column_10, column_11)
from table
where id = 5
Much better than the code from the beginning of the blog post:
select
column_1,
column_2,
column_3 as column_3_renamed,
column_4,
column_5,
column_6,
column_7,
column_8,
column_12
from table
where id = 5
Notes
1
To give you a sense of demand, this stack overflow question, asked almost 14 years ago,
has 1.3 million views and over 1000 upvotes. ↩
2
Shoutout to Nate Sooter for motivating this example with his recent tweet! ↩
Ian Whitestone
Range joins and other types of non-equi joins are notoriously slow in most
databases. While Snowflake is blazing fast for most queries, it too suffers
from poor performance when processing these types of joins. In this post
we'll cover an optimization technique practitioners can use to speed up
queries involving a range join by up to 300x . 1
Before diving into the optimization technique, we'll cover some background
on the different types of joins and what makes range joins so slow in
Snowflake. Feel free to skip ahead if you're already familiar.
select
...
from orders
join customers
select distinct
o1.customer_id,
o2.customer_id,
o1.product_id
from orders_items as o1
join orders_items as o2
...
from orders
on orders.customer_id=customers.id
select
seconds.timestamp,
count(queries.query_id) as num_queries
from seconds
on seconds.timestamp between
group by 1
This join could also be based on derived timestamps. For example, find all
purchase events which occurred within 24 hours of users viewing the home
page:
select
...
from page_views
select
s1.session_id,
array_agg(s2.session_id) as concurrent_sessions
from landing_page_sessions as s1
group by 1
Let's use the "number of queries running per second" example from above
to explore this in more detail.
select
seconds.timestamp,
count(queries.query_id) as num_queries
from seconds
on seconds.timestamp between
group by 1
Our seconds table contains 1 row per second, and the queries table has 1 row
per query. The goal of this query is to lookup which queries were running
each second, then aggregate and count.
When executing the join, Snowflake first creates an intermediate dataset
that is the cartesian product of the two input datasets being joined. In this
example, the seconds table is 7 rows and the queries tables is 4 rows, so the
intermediate dataset explodes to 28 rows. The range join condition that
performs the "point in interval" check happens after this intermediate
dataset is created, as a post-join filter. You can see a visualization of this
process in the image below (go here for a full screen, higher resolution
version).
Running this query on a 30 day sample of data with 267K queries took 12
minutes and 30 seconds. As shown in the query profile, the join is the clear
bottleneck in this query. You can also see the range join condition
expressed as an "Additional Join Condition":
How to optimize range joins in Snowflake
When executing range joins, the bottleneck for Snowflake becomes the
volume of data produced in the intermediate dataset before the range join
condition is applied as a post-join filter. To accelerate these queries, we
need to find a way to minimize the size of the intermediate dataset. This
can be accomplished by adding an equi-join condition, which Snowflake
can process very quickly using a hash join.2
select
seconds.timestamp,
count(queries.query_id) as num_queries
from seconds
group by 1
Promising, but the approach falls apart when the interval (query total run
time) is greater than 1 hour. Because the equi-join is on the hour the query
started in, all records in any subsequent hours wouldn't be counted.
with
query_hours as (
select
queries.*,
hours_list.timestamp as query_hour
from queries
select
seconds.timestamp,
count(queries.query_id) as num_queries
from seconds
group by 1
You might have noticed that the query_hours CTE involves a range join itself -
won't that be slow? When applied for the right queries, the additional time
spent on the input dataset preparation will result in a much faster query
overall . Another concern may be that the query_hours dataset becomes
3
much larger than the original queries dataset, as it fans out to 1 row per
query per hour. Since most queries finish in well under 1 hour,
the query_hours dataset will be similar in size to the original queries dataset.
Adding the new equi-join condition on hours helps accelerate this range join
query by constraining the size of the intermediate dataset. However, this
approach is not ideal for a few reasons. Maybe hour isn't the best choice,
and something else should be used as a constraint. Additionally, how can
this approach be extended to support range joins involving other numeric
datatypes, like integers and floats?
You can see a visualization of this process in the image below (go here for
a full screen, higher resolution version).
Example binned range join query
Bin numbers are just integers that represent a range of data. One way to
create them is to divide the number by the desired bin size. With
timestamps, we can first convert the timestamp to unix time, which is an
integer, before dividing:
select
timestamp,
Following the steps described above, we first need to generate the list of
applicable bins. This accomplished by using a generator to create a list of
integers, then filtering that list down to the desired start and end bin
numbers . 6
with
metadata as (
select
min(timestamp) as start_time,
max(timestamp) as end_time,
),
-- have to first generate a massive list, then filter down since you can't pass in calculated values
-- when bins_base is 1 trillion takes 5 seconds to filter down. 106 ms for for 1 million
bins_base as (
select
seq4() as row_num
),
bins as (
select
bins_base.row_num as bin_num
from bins_base
),
Now we can add the bin number to each dataset. For the queries dataset,
we'll output a dataset with 1 row per query per bin that the query ran
within. For the seconds dataset, each timestamp will be mapped to a single bin.
queries_w_bin_number as (
select
start_time,
end_time,
warehouse_id,
cluster_number,
bins.bin_num
from queries
on bins.bin_num between
),
seconds_w_bin_number as (
select
timestamp,
from seconds
And apply the final join condition, with the added equi-join condition
on bin_num:
select
s.timestamp,
count(q.warehouse_id) as num_queries
from seconds_w_bin_number as s
on s.bin_num=q.bin_num
group by 1
Using the same dataset as above, this query executed in 2.2 seconds,
7
whereas the un-optimized version from earlier took 750 seconds. That's
over a 300x improvement. The query profile is shown below. Note how the
join condition now shows two sections: one for the equi-join condition
on bin_num, and another for the range join condition.
select
count(*) -- 267K
from queries
Rules of thumb are not perfect. If possible, test your query with a few
different bin sizes and see what performs best. Here's the performance
curve for the query above, using different bin sizes. In this case, picking the
99.9th percentile versus the 90th percentile didn't make much of a
difference. As expected, query times started to get worse once the bin size
got really small.
How to extend to a join with a fixed interval?
If you have a point in interval range join with a fixed interval size, like the
query shared earlier:
select
...
from page_views
Then set your bin size to the size of the interval: 24 hours.
select
s1.session_id,
array_agg(s2.session_id) as concurrent_sessions
from landing_page_sessions as s1
group by 1
You can apply the same binned range join technique after you've fanned
out both landing_page_sessions and app_sessions to contain 1 row per session
per bin the session fell within (as was done with queries above).
The binned range join optimization technique only works for point in
interval and interval overlap range joins involving numeric types. It will not
work for other types of non-equi joins, although you can apply the same
principle of trying to add an equi-join constraint wherever possible to
reduce the row explosion.
If the dataset on the "right", with the start and end times, contains a
relatively flat distribution of interval sizes, then this technique won't be as
effective.
Notes
1
This stat is from a single query, so take it with a handful of salt. Your
mileage will vary depending on many factors. ↩
2
This approach was inspired by Simeon Pilgrim's post from 2016 (back
when Snowflake was snowflake.net!). I used it quite successfully until
implementing the more generic binning approach. ↩
3
The range join to the hours table will be much quicker than the range join
to the seconds table, since the intermediate table will be ~3600 times
smaller. ↩
4
This approach was inspired by Databricks. They don't go into the details
of how their algorithm is implemented, but I assume it works in a similar
fashion. ↩
5
Optionally create a get_bin_number function to avoid copying the same
calculation throughout the query:
returns integer
as
$$
$$
↩
6
Snowflake doesn't let you pass in calculated values to the generator, so
this had to be done in a two step process. In the near future we'll be open
sourcing some dbt macros to abstract this process away. ↩
7
The full example binned range join optimization query:
returns integer
as
$$
$$
with
metadata as (
select
min(timestamp) as start_time,
max(timestamp) as end_time,
from seconds
),
bins_base as (
select
seq4() as row_num
),
bins as (
select
bins_base.row_num as bin_num
from bins_base
),
queries_w_bin_number as (
select
start_time,
end_time,
warehouse_id,
cluster_number,
bins.bin_num
from queries
on bins.bin_num between
get_bin_number(queries.start_time, $bin_size_s)
),
seconds_w_bin_number as (
select
timestamp,
from seconds
select
s.timestamp,
count(q.warehouse_id) as num_queries
from seconds_w_bin_number as s
on s.bin_num=q.bin_num
group by 1
;
↩
3 ways to configure Snowflake warehouse
sizes in dbt
Date
Wednesday, January 18, 2023
Niall Woodward
The ability to use different warehouse sizes for different workloads in Snowflake provides
enormous value for performance and cost optimization. dbt natively integrates with
Snowflake to allow specific warehouses to be chosen down to the model level. In this post,
we explain exactly how to use this feature and share some best practices.
The single most effective way of speeding up any query is to reduce the amount of data it
processes. If a model is becoming slow and uses a table materialization, consider the
possibility of using an incremental materialization to process only new or updated data each
time it runs.
If you’re already using incrementalization, or it’s not possible, then increasing the warehouse
size is likely the next best step for speeding up the model.
profiles.yml
In your profiles.yml file, edit the warehouse config to a different virtual warehouse. The
size of each warehouse is configured in Snowflake.
select_internal:
outputs:
dev:
type: snowflake
account: org.account
user: niall
password: XXXXX
warehouse: dev
database: dev
schema: niall
threads: 8
target: dev
dbt Cloud
In dbt Cloud, navigate to Deploy > Environments in the top menu bar. Choose the
environment you want to edit, then Settings. Click Edit, then scroll down to Deployment
Connection where the warehouse can be changed. The size of the warehouse is configured in
Snowflake.
Changing the default dbt warehouse size isn’t necessarily wise, however, as most queries in
the project will not benefit from the increased warehouse size, leading to increased
Snowflake costs. For more details on the impact of warehouse size on query speed, see
our warehouse sizing post. Instead of increasing the default warehouse size, we recommend
setting the default warehouse size to X-Small, and overriding the warehouse size at the
individual model level as needed.
Hardcoded warehouse
dbt provides the snowflake_warehouse model configuration, which looks like this when set
in a specific model:
{{ config(
snowflake_warehouse="dbt_large"
) }}
select
...
from {{ ref('stg_orders') }}
name: my_project
version: 1.0.0
---
models:
+snowflake_warehouse: 'dbt_xsmall'
my_project:
clickstream:
+snowflake_warehouse: 'dbt_large'
snowflake_warehouse=get_warehouse('large')
) }}
select
...
from {{ ref('stg_orders') }}
This macro can implement logic to return the desired warehouse size for the environment.
{% macro get_warehouse(size) %}
{% endif %}
{% do return('dbt_production_' ~ size) %}
{% do return('dbt_ci_' ~ size) %}
{% else %}
{% do return(None) %}
{% endif %}
{% endmacro %}
Using a macro for the snowflake_warehouse config only works in model files, and cannot be
used in the dbt_project.yml .
Conclusion
Thanks for reading! In an upcoming post we’ll share more recommendations for optimizing
dbt performance on Snowflake. Make sure to subscribe for notifications on future posts, and
feel free to reach out if you have any questions!
Ian Whitestone
Co-founder & CEO of SELECT
Snowflake query tags allow users to associate arbitrary metadata with each
query. In this post, we show how you can use query tags to achieve better
visibility & monitoring for your Snowflake query costs and performance.
Query tags enable more fine grained cost attribution. If you have a single
SQL statement, or series of SQL statements associated with a data model in
a pipeline, you can assign them the same query tag. Costs can then be
easily attributed to all queries associated with the given tag. The alternative
to this involves grouping by query_text, which does not allow for multiple
related SQL statements to be bucketed together. It also falls apart when the
SQL text for a given data model inevitably gets changed.
Query tags can also be used for more granular query
performance monitoring. Sticking with the example from earlier, users may
wish to monitor the total runtime for each data model by grouping
together the total elapsed time for all associated queries. Alternatively, if a
set of queries are used to power user facing application dashboards,
leveraging query tags can allow for more targeted performance monitoring.
Last of all, query tags provide the ability to link queries with metadata from
other systems. A query tag could contain a dashboard_id, which could enable
users to aggregate all costs for a single dashboard, and then see how often
that dashboard is used through the BI tool's metadata.
Every query issued by this user will now have this default tag.
select *
from raw_users
where
not deleted
);
from users_tmp
select *
from raw_orders
where
not deleted
);
con = snowflake.connector.connect(
user='XXXX',
password='XXXX',
account='XXXX',
session_parameters={
'QUERY_TAG': 'DATA_MODELLING_PIPELINE',
query = """
from raw_users
where
not deleted
"""
query = """
select *
from raw_orders
where
not deleted
"""
con.cursor().execute(query) # tagged with 'orders_model'
1. They can be set once in your profiles.yml (source). All queries run in
your dbt project will then be tagged with that value.
2. Tags can be set for all models under a particular resource_path, or for a
single model, by adding a +query_tag in your dbt_project.yml. For
individual models, you can also specify the query tag in the model
config, i.e. {{ config(query_tag = 'XXX') }} . If a default query tag has
been set in profiles.yml, it will be overridden by any of these more
precise tags.
3. You can create a set_query_tag macro which automatically sets the
query tag to the model name for all models in your project.
import json
query_tag = {
'app_name': 'pipeline',
'model_name': 'users',
'environment': 'prod',
'version': 'v1.2',
'trigger': 'schedule'
con.cursor().execute(model_sql)
select
query_tag,
count(*) as num_executions,
avg(total_elapsed_time/1000) as avg_total_elapsed_time_s
from snowflake.account_usage.query_history
where
group by 1
If the query_tag contains a JSON object, it can be parsed and segmented by
any of the keys. Using the example from above:
select
try_parse_json(query_tag)['model_name']::string as model_name,
count(*) as num_executions,
avg(total_elapsed_time/1000) as avg_total_elapsed_time_s
from snowflake.account_usage.query_history
where
try_parse_json(query_tag)['app_name']::string = 'pipeline'
group by 1
select
try_parse_json(query_tag)['model_name']::string as model_name,
count(*) as num_executions,
sum(query_cost) as total_cost,
avg(total_elapsed_time_s) as avg_total_elapsed_time_s
from query_history_enriched
where
try_parse_json(query_tag)['app_name']::string = 'pipeline'
group by 1
You'll also find that these queries run much quicker than a query against
Snowflake's account usage views, since the table is materialized and sorted
by start_time to achieve a well clustered state.
select *
from raw_orders
where
not deleted
);
Where possible, we recommend using query tags since they are much
simpler to parse and analyze downstream. If it's possible for your query
metadata to exceed 2000 characters, stick with query comments.
1
Snowflake automatically removes any comments at the beginning of each
query, so you must append them to the end of the query.
2
The alter session statement itself is extremely fast, taking about 30ms on
average.
Monitoring dbt model spend and
performance with metadata
Date
Friday, February 24, 2023
Ian Whitestone
When using Snowflake and dbt, customers do not get these crucial
monitoring features out of the box. By adding metadata to their dbt models
through query tags or comments, customers can achieve these core
monitoring abilities.
Approach #3 is by far the best option as it avoids having users manually set
the tags. If you'd like to get started with dynamically setting query tags for
each model, you can implement a custom macro like the one here to add
detailed metadata to each query issued by dbt.
query-comment:
select ...
);
"dbt_snowflake_query_tags_version": "2.0.0",
"app": "dbt",
"dbt_version": "1.4.0",
"project_name": "my_project",
"target_name": "dev",
"target_database": "dev",
"target_schema": "larry_goldings",
"invocation_id": "c784c7d0-5c3f-4765-805c-0a377fefcaa0",
"node_name": "orders",
"node_alias": "orders",
"node_package_name": "my_project",
"node_original_file_path": "models/staging/orders.sql",
"node_database": "dev",
"node_schema": "mart",
"node_id": "model.my_project.orders",
"node_resource_type": "model",
"materialized": "incremental",
"is_incremental": true,
"node_refs": [
"raw_orders",
"product_mapping"
],
"dbt_cloud_project_id": "146126",
"dbt_cloud_job_id": "184124",
"dbt_cloud_run_id": "107122910",
"dbt_cloud_run_reason_category": "other",
}
Using this info, you can can monitor cost and performance by a variety of
interesting dimensions, such as dbt project, model name, environment (dev
or prod), materialization type, and more.
select
try_parse_json(query_tag)['model_name']::string as model_name,
avg(total_elapsed_time/1000) as avg_total_elapsed_time_s,
avg(execution_time/1000) as avg_execution_time_s,
from snowflake.account_usage.query_history
where
try_parse_json(query_tag)['app']::string = 'dbt'
group by 1,2
If you're using query comments, you'll first have to parse out the metadata
from the comment text:
with
query_history as (
select
*,
try_parse_json(_dbt_json_meta) as dbt_metadata
from snowflake.account_usage.query_history
select
dbt_metadata['model_name']::string as model_name,
avg(total_elapsed_time/1000) as avg_total_elapsed_time_s,
...
from query_history
where
group by 1,2
Both queries above only look at the query run times. A number of other
metrics you can monitor in conjunction are:
select
try_parse_json(query_tag)['model_name']::string as model_name,
sum(query_cost) as total_cost,
from query_history_w_costs
where
try_parse_json(query_tag)['app']::string = 'dbt'
group by 1,2
If you'd like to automatically get a query history with the costs for each
query added, you can install our dbt-snowflake-monitoring package.
If you're looking to cut your Snowflake costs, want to get a better picture of
what's driving them, or just want to keep a pulse on things, you can get
access today or book a demo using the links below.
Should you use CTEs in Snowflake?
Date
Wednesday, March 15, 2023
Niall Woodward
CTEs are an extremely valuable tool for modularizing and reusing SQL logic.
They're also a frequent focus of optimization discussions, as their usage has
been associated with unexpected and sometimes inefficient query
execution. In this post, we dig into the impact of CTEs on query plans,
understand when they are safe to use, and when they may be best avoided.
Introduction
Much has been written about the impact of CTEs on performance in the
past few years:
But the fact we’re still seeing so much discussion shows we’ve not reached
a conclusion yet. This post aims to provide a reasoned set of guidelines for
when you should use CTEs, and when you might want to avoid them.
Snowflake's query optimizer is being continuously improved, and like in the
posts linked above, the behavior observed in this post will change over
time.
Let’s start with a recap of what CTEs are and why they’re popular.
with my_cte as (
select 1
with my_cte as (
select 1
),
my_cte_2 as (
select 2
)
select *
from my_cte
We can also put CTEs inside CTEs if we so wish (though things get a little
hard to read!):
with my_cte as (
with my_inner_cte as (
select 1
select *
from my_cte
1. CTEs can help separate SQL logic into separate, isolated subqueries.
That makes debugging easier as you can simply select * from
cte to run a CTE in isolation.
2. CTEs provide a way of writing procedural-like SQL in a top-to-bottom
style, which can help with code review and maintainability.
3. CTEs can help conform to the DRY (don’t repeat yourself) principle,
providing a single place to define logic that is referenced multiple
times downstream.
with sample_data as (
select *
from snowflake_sample_data.tpch_sf1.customer
select *
from sample_data
where c_nationkey = 14
But if we reference that CTE more than once we see something different,
and the query’s execution is now different to if we’d referenced the table
directly rather than use a CTE.
with sample_data as (
select *
from snowflake_sample_data.tpch_sf1.customer
),
nation_14_customers as (
select *
from sample_data
where c_nationkey = 14
),
nation_9_customers as (
select *
from sample_data
where c_nationkey = 9
select *
from nation_14_customers
union all
select *
from nation_9_customers
We see two new node types, the WithClause and the WithReference .
The WithClause represents an output stream and buffer from
the sample_data CTE we defined, which is then consumed by
each WithReference node. Note that Snowflake is intelligently ‘pushing
down’ the filter in the nation_14_customers and nation_9_customers CTEs
to the TableScan before the WithClause. Previously, Snowflake didn’t do
this, as reported in Dominik’s post. It’s worth checking that this behavior
applies to more complex queries, but for this query, the profile is the same
as if we’d written the query like:
with sample_data as (
select *
from snowflake_sample_data.tpch_sf1.customer
),
nation_14_customers as (
select *
from sample_data
where c_nationkey = 14
),
nation_9_customers as (
select *
from sample_data
where c_nationkey = 9
select *
from nation_14_customers
union all
select *
from nation_9_customers
with nation_14_customers as (
select *
from snowflake_sample_data.tpch_sf1.customer
where c_nationkey = 14
),
nation_9_customers as (
select *
from snowflake_sample_data.tpch_sf1.customer
where c_nationkey = 9
select *
from nation_14_customers
union all
select *
from nation_9_customers
The differences are:
Now we understand how CTEs are translated into an execution plan, let's
explore the performance implications.
Sometimes, it’s faster to repeat logic than re-use
a CTE
Most of the time, Snowflake’s strategy of computing a CTE's result once
and distributing the results downstream is the most performant strategy.
But in some circumstances, the cost of buffering and distributing the CTE
result to downstream nodes exceeds that of recomputing it, especially as
the TableScan nodes use cached results anyway.
with lineitems as (
select *
from snowflake_sample_data.tpch_sf100.lineitem
),
lineitem_future_sales as (
select
a.l_orderkey,
a.l_linenumber,
sum(b.l_quantity) as future_part_order_total
from lineitems as a
on a.l_partkey = b.l_partkey
select *
from lineitems
lineitems.l_orderkey = lineitem_future_sales.l_orderkey
with lineitem_future_sales as (
select
a.l_orderkey,
a.l_linenumber,
sum(b.l_quantity) as future_part_order_total
from (
select *
from snowflake_sample_data.tpch_sf100.lineitem
) as a
from snowflake_sample_data.tpch_sf100.lineitem
) as b
on a.l_partkey = b.l_partkey
group by 1, 2
select *
from (
select *
from snowflake_sample_data.tpch_sf100.lineitem
lineitems.l_orderkey = lineitem_future_sales.l_orderkey
Recommendation
CTEs can be used with confidence in Snowflake, and a CTE that’s referenced
only once will never impact performance. Aside from some very specific
examples like the above, computing the CTE once and re-using it will yield
the best performance vs repeating the CTEs logic. In the previous section,
we’ve seen that Snowflake will intelligently push down filters into CTEs to
avoid unnecessary full table scans.
Column pruning always works when a CTE is referenced once (when CTEs
are referenced only once they are treated as if they don’t exist). In a simple
case:
with sample_data as (
select *
from snowflake_sample_data.tpch_sf1.customer
from sample_data
We can see that only the two columns that were selected were read from
the underlying table. But as we know from before, a CTE which is
referenced only once is a pass-through, and is compiled into a query plan
agnostic of its existence.
Column pruning stops working when a CTE is referenced
more than once
This time, let’s reference the CTE twice, selecting a single column in each
CTE reference.
with sample_data as (
select *
from snowflake_sample_data.tpch_sf1.customer
),
customer_names as (
select c_name
from sample_data
),
customer_addresses as (
select c_address
from sample_data
select c_name
from customer_names
union all
select c_address
from customer_addresses
with customer_names as (
select c_name
from snowflake_sample_data.tpch_sf1.customer
),
customer_addresses as (
select c_address
from snowflake_sample_data.tpch_sf1.customer
select c_name
from customer_names
union all
select c_address
from customer_addresses
As expected, we have two TableScan nodes and each retrieves only the
referenced columns.
with nations as (
select *
from snowflake_sample_data.tpch_sf1.nation
),
joined as (
select *
from snowflake_sample_data.tpch_sf1.customer
on customer.c_nationkey = nations.n_nationkey
Ian Whitestone
credit_quota=5
frequency=daily
start_timestamp=immediately
triggers
on 75 percent do notify
The full syntax for this command can be found in the Snowflake
documentation.
First, you can only set resource monitors at the account or warehouse level.
Serverless features such as automatic clustering or materialized views can
also incur sudden charges but are not supported by resource monitors.
Resource monitors only support notifications via email. Most teams prefer
to receive notifications directly in Slack or Teams along with the rest of their
alerts and team communication.
Additionally, as shown in the screenshot earlier, the email notifications
contain little information about what caused the credit allowance to be
exceeded. Users must then dig through their account usage data to
determine what caused the spike.
Ian Whitestone
To set the query timeout to 1 hour for the current session, run the
command below:
To set the query timeout to 1 hour for a given user, run the command
below:
It will likely be set to the default 2 days (172,800). To drop this down to 1
day, run the command below:
To change the query timeout to 1 hour for the warehouse, run the
command below:
warehouse_size = 'XSMALL'
statement_timeout_in_seconds = 3600
Similar to warehouses and sessions, you can check the current timeout
setting by running show parameters for task my_task . To change the task
timeout to 60 seconds, run alter task my_task set user_task_timeout_ms = 60000 .
Are you billed for queries and tasks cancelled by
Snowflake timeouts?
Yes. Snowflake charges customers for each second a virtual warehouse is
active. If a query runs on a virtual warehouse for 4 hours before it is
cancelled by Snowflake, customers will be charged for the 4 hours that
warehouse was active.
Identifying unused tables in Snowflake
Date
Monday, March 20, 2023
Ian Whitestone
Removing unused datasets can be a quick win for teams looking to reduce
their Snowflake spend. It can also improve security and reduce risks
associated with data breaches and data exposure. The less data you store,
the smaller the footprint for unintended access.
Lastly, deleting unused tables can improve overall data warehouse usability.
Unused datasets often contain data that is stale or not meant to be
accessed, so removing these tables can help avoid confusion or reporting
errors.
In this post we’ll cover how to identify unused tables in Snowflake using
the access_history account usage view.
Skip to the final SQL?
If you want to skip ahead and see the final SQL implementation, you can
head straight to the end!
select *
from orders
where
not test
and success
);
The query select * from orders_view directly accesses the orders_view object,
and indirectly accesses the base orders table.
Correspondingly, orders_view will appear in the direct_objects_accessed column
of access_history, whereas orders will appear in base_objects_accessed .
When it comes to deciding if a table is unused, it’s important to
use base_objects_accessed since this will account for queries that indirectly
access a table through a view.
Parsing base_objects_accessed
is a JSON array of all base data objects accessed
base_objects_accessed
during query execution. Here’s an example of the column’s contents from
the documentation:
"columns": [
"columnId": 68610,
"columnName": "CONTENT"
],
"objectDomain": "Table",
"objectId": 66564,
"objectName": "GOVERNANCE.TABLES.T1"
The array of objects accessed by each query can be transformed to one row
per object via lateral flatten and then filtered to only consider table
objects, as shown below:
with
access_history as (
select *
from snowflake.account_usage.access_history
),
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectId::integer as table_id,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
),
table_access_history as (
select
query_id,
query_start_time,
user_name,
object_name as fully_qualified_table_name
from access_history_flattened
where
select *
from table_access_history
with
access_history as (
select *
from snowflake.account_usage.access_history
),
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectId::integer as table_id,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
),
table_access_history as (
select
query_id,
query_start_time,
user_name,
object_name as fully_qualified_table_name
from access_history_flattened
where
select
fully_qualified_table_name,
max(query_start_time) as last_accessed_at,
from table_access_history
group by 1
select
id as table_id,
total_storage_tb*12*23 as annualized_storage_cost
from snowflake.account_usage.table_storage_metrics
where
not deleted
Identify all tables not queried in the last X days
So far we’ve covered how to determine when a table was last
accessed and the storage costs associated with each table. We can tie
these building blocks together to identify all tables not queried in the last
90 days and show the annual savings that could be expected if the tables
were deleted.
with
access_history as (
select *
from snowflake.account_usage.access_history
),
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectId::integer as table_id,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
),
table_access_history as (
select
query_id,
query_start_time,
user_name,
object_name as fully_qualified_table_name,
table_id
from access_history_flattened
where
),
table_access_summary as (
select
table_id,
max(query_start_time) as last_accessed_at,
from table_access_history
group by 1
),
table_storage_metrics as (
select
id as table_id,
total_storage_tb*12*23 as annualized_storage_cost
from snowflake.account_usage.table_storage_metrics
where
not deleted
select
table_storage_metrics.*,
from table_storage_metrics
where
with
table_access_summary as (
select
table_id,
max(query_start_time) as last_accessed_at,
from query_base_table_access
group by 1
),
table_storage_metrics as (
select
id as table_id,
total_storage_tb*12*23 as annualized_storage_cost
from snowflake.account_usage.table_storage_metrics
where
not deleted
select
table_storage_metrics.*,
from table_storage_metrics
on table_storage_metrics.table_id=table_access_summary.table_id
where
select
table_id,
table_catalog||'.'||table_schema||'.'||table_name as fully_qualified_table_name,
last_altered as last_altered_at
from snowflake.account_usage.tables
where
Wrapping Up
Removing unused tables represents one of the many cost saving
opportunities available to Snowflake users. In addition to surfacing table
access patterns, SELECT automatically produces a variety of other
optimization recommendations. Get access today or book a demo using the
links below.
Snowflake Pricing Explained | 2024 Billing
Model Guide
Date
Wednesday, March 6, 2024
Niall Woodward
Co-founder & CTO of SELECT
Ian Whitestone
Snowflake Overview
Snowflake is a data cloud platform used by organizations to store, process
and analyse data. Snowflake uses the big three cloud providers for hosting
- AWS (Amazon Web Services), GCP (Google Cloud Platform), and Microsoft
Azure. Snowflake is a fully-managed platform, and users have no direct
access to the underlying infrastructure. This is owed to Snowflake’s goal of
making the platform straightforward to use by managing complexity while
still providing a powerful feature set.
Snowflake Editions
You can think of Snowflake Editions as different plans Snowflake offers.
Snowflake Regions
An individual Snowflake account runs in a single region. Customers are free
to create as many accounts as they wish across both cloud providers and
regions. Snowflake has over 35 regions to choose from. Customers typically
choose the same region and cloud provider as their existing infrastructure
for their primary account.
1. Snowflake edition
2. Hosting region
3. Cloud provider
4. Discounts
Here is the range of per-credit pricing across each edition, using on-
demand payment terms, where usage is invoiced every month. The lower
value of each range represents the typical US AWS regions used by most
customers, with the upper values in regions outside of the USA.
For a comprehensive table of Snowflake credit costs across all clouds and
regions, see the Snowflake Credit Consumption Table.
X-Small 1 N/A
Small 2 N/A
Medium 4 6
Large 8 12
X-Large 16 24
Warehouse Size Credits / Hour Snowpark-Optimized
2X-Large 32 48
3X-Large 64 96
Serverless Pricing
For Snowflake’s serverless features like Snowpipe, Automatic
Clustering and Serverless tasks, credits are consumed using a multiplier
specific to each feature. The cheapest services from a credits per hour
perspective are Query Acceleration and Snowpipe Streaming, which both
incur a cost of 1 compute credit per hour. The most expensive feature is the
Search Optimization Service, which incurs a cost of 10 compute credits per
hour.
Clustered tables 2 1
Logging 1.25 1
Query acceleration 1 1
Compute Credits per
Feature Hour Cloud Services C
Replication 2 1
Storage Pricing
Snowflake stores data in a proprietary file format called micro-partitions in
the cloud storage service of the underlying cloud provider (i.e. Amazon S3,
Azure Blob Storage, Google Cloud Storage).
Snowflake uses direct dollar pricing for storage. Prices again vary
depending on cloud provider and region. Customers on AWS USA regions
pay $23 per TB per month. Regions outside of the USA again are more
expensive. A comprehensive breakdown of all storage costs across all
regions can be found on the Snowflake website.
If you're looking for a deep dive into Snowflake storage costs, be sure to
check out our post here. For quick wins on reducing unnecessary storage
costs, you can query your account to identify any unused tables and
remove them.
Compute 100
Cloud Services 5
Total 100
Service C
Compute 100
Cloud Services 15
Total 105
Only 10 credits are rebated, calculated as 10% of the compute credits used.
Therefore, the customer is charged for 5 cloud services credits.
Most customers never pay for cloud services due to this 10% policy.
Scenarios where this isn’t the case are typically where a large number of
simple queries are been executed, as these have a high cloud services cost
relative to their compute costs.
SPCS runs on top of Compute Pools, which are different than virtual
warehouses. The credits per hour for each type of compute are shown
below:
Compute Node Type XS S M
Smallest in
Containers.
CPU - XS 2 8 250 n/a n/a 50 started.
Ideal for ho
CPU - S 4 16 250 n/a n/a 50 saving cost
Ideal for ha
CPU - M 8 32 250 n/a n/a 20 multiple se
vCP Memory Storage GPU Memory Max.
INSTANCE_FAMILY U (GiB) (GiB) GPU per GPU (GiB) Limit
For applica
CPU - L 32 128 250 n/a n/a 20 number of
High-Memory CPU -
S 8 64 250 n/a n/a 20 For memor
4 NVIDIA Optimized
GPU - M 48 192 250 A10G 24 5 like Compu
etc.
Is Snowflake expensive?
There is a widespread rhetoric that Snowflake is expensive, and if managed
improperly, it can be. Customers not choosing the right warehouse
size or creating too many warehouses without proper controls in place can
often be a culprit of runaway costs. However, all usage-based cloud
platforms get expensive when not used thoughtfully, this isn’t unique to
Snowflake. When using the right processes, monitoring and management,
Snowflake can be a very cost-effective choice for a cloud data platform.
At SELECT, we've built our entire data platform on top of Snowflake due to
its ease of use, scalability, and it's cost-effectiveness.