Professional Documents
Culture Documents
Snowflake's performance capabilities are unparalleled, but to truly maximize performance and
minimize costs, Snowflake performance tuning and optimization are essential. While Snowflake
can handle massive data volumes and complex queries with ease, it does not mean it always
should.
In this article, we explored several Snowflake features and best practices to boost query
performance.To summarize:
Monitor and reduce queueing to minimize wait times. Queueing is often the culprit
behind sluggish queries.
Leverage result caching to reuse results and slash compute time.
Tame row explosion with techniques like Optimizing data types and data volumes,
Optimizing subqueries, Making use of DISTINCT clause, Avoiding cartesian products
and making use of Temporary tables.
Optimize query pruning to scan less data and improve query speed.
Manage and monitor disk spillage to avoid performance impacts.
While this covers many essential Snowflake performance tuning techniques, it only scratches the
surface of Snowflake's capabilities. This concludes Part 3 of our article series on Snowflake
performance tuning. If you missed the previous installments, be sure to check out Part 1 and Part
2 for a comprehensive understanding of the topic.
FAQs
What are some common Snowflake Performance Tuning techniques?
Common Snowflake Performance Tuning techniques include query optimization, data clustering,
result caching, monitoring disk spillage and using the right Snowflake warehouse size for your
workload.
How does query optimization improve Snowflake performance?
Query optimization in Snowflake can significantly improve performance by reducing the amount
of data scanned during a query, thus reducing the time and resources required to execute query.
What is the role of data clustering in Snowflake Performance Tuning?
Data clustering in Snowflake helps to minimize the amount of data scanned during a query,
which can significantly improve query performance.
How does warehouse size affect Snowflake performance?
The size of the Snowflake warehouse can greatly impact performance. Larger warehouses can
process queries faster, but they also consume more credits. Therefore, it's important to choose
the right size for your specific needs.
Can caching improve Snowflake performance?
Yes, caching can significantly improve Snowflake performance. Snowflake automatically caches
data and query results, which can speed up query execution times.
How do you check Snowflake performance?
There are several ways to check your Snowflake performance, including monitoring query
execution times, analyzing query plans, and reviewing resource usage. You can also use the
Snowflake web interface (Snowsight), Snowflake's built-in performance monitoring tools, such
as the Query Profile and Query History views, the ACCOUNT_USAGE schema, as well as
third-party tools like Chaos Genius, to gain insights into your system's performance and identify
areas for improvement.
What makes Snowflake fast?
Snowflake's architecture is designed to be highly scalable and performant. It uses a unique
separation of compute and storage, allowing for independent scaling of each component. Also,
Snowflake uses a columnar data format and advanced compression techniques to minimize data
movement and optimize query performance
How do you handle long running queries in a Snowflake?
To handle long-running queries in Snowflake, you can use the 'QUERY_HISTORY' function to
identify them. Once identified, you can either manually cancel the query using the
'SYSTEM$CANCEL_QUERY' function, or optimize the query for better performance.
What is Snowflake stage?
A Snowflake stage is a location where data can be stored within the Snowflake data warehouse.
It can be thought of as a folder or directory within the Snowflake environment where data files
in various formats (such as CSV, JSON, or Parquet) can be stored and accessed by Snowflake
users.
What is the difference between Snowflake stage and External Tables?
A Snowflake stage is a storage location for data files, whereas an external table is a virtual table
that points to data stored outside Snowflake. The key difference is that a stage loads data into
Snowflake, while an external table enables querying of data located external to Snowflake.
What is the difference between external table and regular table in Snowflake?
External tables reference external data sources. Regular tables store data natively within
Snowflake.
What is the difference between Snowpipe and Snowflake external table?
Snowpipe in Snowflake is an automated data ingestion service that continuously loads data from
external sources into Snowflake tables. On the other hand, an external table is a virtual table that
references data stored outside of Snowflake.
What are the supported Snowflake file formats?
Snowflake supports a variety of file formats, including CSV, JSON, Avro, Parquet, ORC, and
XML.
How do you use a file format in Snowflake?
Create it with CREATE FILE FORMAT, specifying the format type, compression, encoding,
etc. Use it when loading/unloading data.
What is default Snowflake file format?
The default Snowflake file format is CSV. This means that if you do not specify a file format
when loading or unloading data, Snowflake will use the CSV format.
What is the best Snowflake file format?
It depends on use case. Parquet/ORC for structured data, JSON/Avro for semi-structured, CSV
for simple use cases. Consider size, performance, compression.
How do I get a list of Snowflake file formats in a Snowflake?
Use SHOW FILE FORMATS to display names, types, and properties of existing file
foSnowflake Query Tagging by User/Account/Session
You can set Snowflake query tags at three different levels:
Account
User
Session
When Snowflake query tags are set at multiple levels, they are applied in a specific order of
precedence. So at the account level, a query tag is set for the entire Snowflake account. If a user-
level query tag is set for a specific user in the Snowflake account, it will override the account-
level query tag for any queries that the user runs.
In the same way, if a user-level or account-level query tag is already in place and a session-level
query tag is set, the session-level query tag will take priority over the user-level, or account-
level tag for any queries run during that session.
Hence, Snowflake query tags can be set at multiple levels, but the order of precedence
determines which tag is applied for any given query.
Here's an in-depth explanation of each level:
Account-Level Query Tag:
An account-level query tag is a tag that applies to all queries run in the account, no matter who
runs them. This means that if an account admin sets an account-level query tag, it will be
automatically applied to every query run by any user in that account.
Note: Only an ACCOUNTADMIN can set an account-level query tag.
To assign an account-level query tag in Snowflake, you can use the ACCOUNTADMIN role to
set the tag using the ACCOUNT_USAGE view.
Here's an example SQL query:
USE ROLE ACCOUNTADMIN;
ALTER ACCOUNT SET QUERY_TAG = 'Account_level_query_tag';
FAQs
How are query tags different from object tags?
Query tags and object tags serve different purposes in Snowflake. Query tags are used for
tagging SQL statements to enable better cost and performance monitoring. Object tags, on the
other hand, are used for persistent account objects like users, roles, tables, views, and functions.
Why should I use query tags in Snowflake?
Query tags provide enhanced cost attribution and performance monitoring. They allow for fine-
grained cost attribution by associating costs with specific queries or sets of related queries.
Can query tags be used with dbt (data build tool)?
Yes, query tags can be used with dbt.
Can query comments be used instead of query tags?
Yes, query comments can be used to add metadata to queries, but they have some limitations.
Query tags are simpler to parse and analyze downstream, while query comments may require
additional processing. Query comments also have a size limit of 1MB, whereas query tags are
limited to 2000 characters.
How do you set a query tag in Snowflake?
To set a query tag in Snowflake, you can use the "ALTER USER" command with the SET
QUERY_TAG parameter. This allows you to assign a tag to specific users.
5 Tips for Snowflake query optimization using Snowflake Query Profile
Understand the Snowflake Operators
To increase the efficiency of Snowflake queries, it is important to understand the roles each
operator plays in the query pipeline and how different operators interact with one another. So, if
you understand certain patterns and look at the results of each step, you can find places to
improve to make the Snowflake query optimization process faster and more efficient.
Identify Performance Issues
Use the Query Profile to see which steps of the query pipeline are consuming the most time and
concentrate your Snowflake query optimization efforts there. After you've identified potential
performance bottlenecks, look into why some operators are taking longer than expected or using
too many resources. And once these issues have been identified, you can take corrective action
to increase query efficiency and minimize overall runtime.
Search for long-running stages
When analyzing the query profile, ALWAYS ALWAYS pay attention to stages that require
more processing time, as these are often the most likely candidates to experience slowdowns. So
by pinpointing the stages that need improvement, you can then focus your Snowflake query
optimization efforts on them and work towards improving overall Snowflake query performance.
"Query Details" to identify bottlenecks
"Query Details" section of the query profile provides additional information about the data
involved in the query, including the number of rows and bytes processed at each stage. Use these
stats to identify bottlenecks in the query. For instance, if you see that a particular stage is
processing a large number of rows, you may want to consider optimizing the query to reduce the
amount of data processed.
FAQs
What information does Snowflake Query Profile provide?
Snowflake Query Profile provides information such as query runtime stats, query execution
stats, query resource usage, and query error/warning messages.
How do you profile a query in a Snowflake?
Snowflake Query Profile can be accessed through Snowsight or Classic UI by navigating to the
Activity or History sections and selecting the desired query. Also, you can use the
GET_QUERY_STATS function to retrieve query statistics programmatically.
How can I analyze a query using Snowflake Query Profile?
Snowflake Query Profile offers a Query Graph View and Query Stats Panel. The Query Graph
View visualizes the data pipeline flow, while the Query Stats Panel provides performance
metrics and information about each operator in the pipeline.
When should I use the Snowflake Query Profile?
The Query Profile should be used when you need more diagnostic information about a query,
such as understanding its performance or identifying bottlenecks.
What are some things to look for in the Snowflake Query Profile?
Some indicators of poor query performance in the Query Profile include high spillage to remote
disk, a large number of partitions scanned, exploding joins, Cartesian joins, operators blocked by
a single CTE, unnecessary early sorting, repeated computation of the same view, and a very
large Query Profile with numerous nodes.
What is the purpose of the Snowflake query profile?
The Snowflake query profile provides execution details for a query, offering a graphical
representation of the processing plan's components, along with statistics for each component and
the overall query.
What is an example of a Query_tag in Snowflake?
Query tags in Snowflake are session-level parameters that can have defaults set at the account
and user level. For example, you can set a default query tag like '{"DevTeam": "software",
"user": "John"}', and every query issued by that user will have this default tag.
How do you write an efficient Snowflake query?
To write efficient queries in Snowflake, you should focus on scan reduction (limiting the volume
of data read), query rewriting (reorganizing a query to reduce cost), and join optimization
(optimally executing joins).
How do you write efficient Snowflake queries?
Follow these best practices:
Use Snowflake Query Profile tool.
Choose the right-sized virtual warehouse.
Maximize caching.
Leverage materialized views.
Optimize data clustering and micro-partitioning.
Utilize Snowflake Query Acceleration.
Consider using the Search Optimization Service.
free trial!
Get started for free
(No credit card required)
Snowflake Storage Costs 101 - An In-Depth Guide to Manage and Optimize Storage Costs
In this article, we will provide valuable insights and techniques for reducing Snowflake storage
costs. We'll take a closer look at how Snowflake storage costs work—and point out specific key
areas where you can clean it up. This article is suitable for both new and experienced Snowflake
users, covering key areas for Snowflake cost optimization and best practices. By the end of this
article, you'll have a better understanding of Snowflake storage costs and the tools & techniques
needed for improving your Snowflake ROI.
How Do Snowflake Storage Costs Work?
Before we do a deep dive into the intricacies of understanding and optimizing Snowflake storage
costs, let's first understand its "multi-cluster, shared data" design architecture - separating
storage, compute, and cloud services. This design allows Snowflake to logically integrate these
components while physically separating them. It differs from traditional data warehouses, where
storage, compute, and cloud services are tightly coupled. Let's delve further into the design of
a multi-cluster, shared data system, focusing on its three components:
Storage: This is the persistent storage layer for data stored in Snowflake. It resides in a
scalable cloud storage service(such as GCS or S3) that ensures data replication, scaling,
and availability without customer management. Snowflake optimizes and stores data in a
columnar format within this storage layer, organized into databases as specified by the
user.
Compute: This is a collection of independent compute resources that execute data
processing tasks required for queries.
Cloud Services: This is a collection of system services that handle infrastructure,
security, metadata, and optimization across the entire Snowflake.
Warehouse selection
Warehouse size
Filter by Tag
Usage dashboard in your organization allows you to filter storage usage by a specific tag/value
combination. This feature is designed to enable you to attribute Snowflake costs to specific
logical units within your organization. Using tags, you can easily identify and track the storage
usage for different departments, projects, or any other logical units you have defined in your
organization. This feature is similar to filtering credit consumption by tag, enabling you to track
and attribute costs to specific units within your organization. In both cases, using tags allows for
easy tracking and cost attribution, making it easy to understand and manage your organization's
storage and credit usage.
Filter by Tag
View Snowflake Storage by Type or Object
The bar graph on the Usage dashboard of your organization allows you to filter Snowflake
storage usage data either By type or By object.
Table size
Query for Table Size in Snowflake
It is possible to gain insights into tables, including their size, by writing SQL queries, instead of
using the Snowsight web interface because this allows users for more flexibility + automation
when working with large amounts of data [3]. A user with the proper access/privileges can list
data about tables using the SHOW TABLES command.
For example, the following query will list all tables in a specific schema:
SHOW TABLES IN SCHEMA your_schema;
Querying Data for Table Size
Let's go a bit further. So now that we know what the SHOW TABLES command does,it's time to
dive deep into actually calculating the size of the individual table so users with
the ACCOUNTADMIN role can use SQL to view table size information by executing queries
against the TABLE_STORAGE_METRICS view in the ACCOUNT_USAGE schema. This view
shows the table-level storage utilization info, which can be utilized to compute the storage
billing for every table in the account, including those that have been removed but are still
consuming the storage fees. TABLE_STORAGE_METRICS contains storage information about
all tables that a particular account holds and tons of information about table size, including the
number of bytes currently being stored, the number of bytes that have been stored historically—
and more. For a quick example, here is a sample query that will calculate and return the total
table size in GB for the table DATASETS which is present
in REAL_ESTATE_DATA_ATLAS database:
USE role ACCOUNTADMIN;
SELECT TABLE_NAME,
TABLE_SCHEMA,
Sum(((active_bytes + time_travel_bytes + failsafe_bytes +
retained_for_clone_bytes)/1024)/1024)/1024 AS TOTAL_STORAGE_USAGE_IN_GB
FROM REAL_ESTATE_DATA_ATLAS.INFORMATION_SCHEMA.TABLE_STORAGE_METRICS
WHERE TABLE_NAME in ("DATASETS");
GROUP BY 1,2
ORDER BY 1,2,3
Note: To query the TABLE_STORAGE_METRICS view, the user must
have ACCOUNTADMIN privilege and there might be 1-2 hour delay in updating the
storage related stats for active_bytes + time_travel_bytes + failsafe_bytes +
retained_for_clone_bytes in this view.
Query for Snowflake Storage Costs: ORGANIZATION_USAGE and ACCOUNT_USAGE Schemas
So before we even start to query the data for Snowflake storage cost, we need to understand that
Snowflake provides two schemas that make it possible to query storage
cost: ORGANIZATION_USAGE and ACCOUNT_USAGE. These schemas contain data related to
usage and cost and provide very detailed and granular, analytics-ready usage data to build
custom reports or even dashboards.
ORGANIZATION_USAGE schema contains data about usage and costs at the organization
level, which includes data about storage usage and cost for all the databases and stages within an
organization in a shared database named SNOWFLAKE. Only the users with
the ORGADMIN or ACCOUNTADMIN role can access the ORGANIZATION_USAGE schema.
For more information on the ORGANIZATION_USAGE views, please refer to this.
ACCOUNT_USAGE schema contains data about usage and costs at the account level, which
includes data about storage usage and cost for all the tables within an account.
For more information on the ACCOUNT_USAGE views, please refer to this.
Most views in these two schemas contain the cost of storage in terms of storage size. But, if you
want to view the cost in currency instead of size, you can write queries using
the USAGE_IN_CURRENCY_DAILY View. This view converts the storage size into actual
currency using the daily price of a TB allowing users to easily understand the actual cost of
storage in terms of currency.
SELECT
DATE_TRUNC('day', USAGE_DATE) AS "Usage Date",
SUM(USAGE) AS "Total Usage",
SUM(USAGE_IN_CURRENCY) AS "Total Cost"
FROM
ORGANIZATION_USAGE.USAGE_IN_CURRENCY_DAILY
WHERE
ORGANIZATION_NAME = 'YourOrganizationName'
AND USAGE_TYPE = 'YourUsageType'
AND USAGE_DATE BETWEEN 'start_date' AND 'end_date'
AND CURRENCY = 'CurrencyOfTheUsage'
GROUP BY
1
ORDER BY
1 DESC
LIMIT
10;
This particular query returns the total usage and total cost for a specific organization, usage type,
and date range, grouped by day and ordered by decreasing usage date, with a limit of 10 rows.
Snowflake Storage Cost Reduction Techniques for Maximizing Snowflake ROI
Let's talk about some ways to keep Snowflake storage costs as low as possible. We've already
talked about how storage, compute, and services are the three main things that make up
Snowflake costs. Snowflake's pricing is based on how much you use it, so you only pay for what
you use. It's really very important to keep an eye on certain areas to reduce storage costs, as you
may be BURNING up space unnecessarily. Here are some important places to keep an eye on to
find ways to clean up and cut down on storage costs.
1. Staged files for bulk data loading and unloading consume storage space. Frequently
check the STORAGE_USAGE view in ACCOUNT_USAGE, which might give an idea
of how much space is consumed by stages.
2. Unused Tables: Use Snowflake's ACCESS_HISTORY object to find how frequently the
objects are queried and drop those that haven't been used in a certain period of time.
3. Time travel Costs: Time travel is a unique feature of Snowflake that retains deleted or
updated data for a specified number of days, but it is important to note that this stored
history data encounters the same cost as active data. Hence, always be careful, make sure
to check it, and consider the data criticality and table churn rate when setting up time-
travel retention periods.
4. Transient Tables: Consider having a transient database/schema for environments where
data has low importance, such as staging environments. This will help control storage
costs and avoid unnecessary data retention.
Conclusion
As with any cloud-based service, you should ALWAYS ALWAYS keep a close eye on your
Snowflake storage costs to make sure you're not spending too much on resources you don't
actually need. By understanding the different storage costs and using tools and techniques like
Snowsight (Snowflake web user interface) and custom queries, users can get valuable
information about the storage costs. Using the tips and tricks outlined in this article, you can
keep your reduce your Snowflake storage costs to a minimum and improve your Snowflake ROI.
FAQs
How storage cost is calculated in Snowflake?
Snowflake storage costs are based on a flat rate per terabyte of consumption (after compression)
and vary depending on the account type (capacity or on demand), Cloud Platform, and region.
What is staged file storage cost in Snowflake?
Staged file storage cost refers to the cost associated with storing files for bulk data loading or
unloading in Snowflake.
How can I monitor Snowflake storage costs?
Snowflake provides tools like Snowsight to monitor storage usage at the account-wide and
individual table levels. Users can also query the ACCOUNT_USAGE and
ORGANIZATION_USAGE schemas to gain detailed information about storage costs.
How much does Snowflake storage cost for 1 TB?
The cost of storing 1 TB of data in Snowflake varies based on factors such as account type,
Cloud Platform, and region. For Capacity storage in the United States, it's around ~$23/month,
while in the EU, it's approximately ~$24.50/month. With On Demand Pricing, it's ~$40/month in
the US and ~$45/month in the EU.
Snowflake Data Transfer Costs 101 - An In-Depth Guide to Manage and Optimize Data Transfer
Costs
Snowflake, a popular cloud data platform, provides users with superior scalability, low latency,
advanced analytics and flexible pay-per-use pricing—but managing its data transfer costs can be
very daunting and hectic for businesses. To help get the most value out of Snowflake, it's
important to understand how Snowflake data transfer costs work, including both data ingress and
egress costs.
In this article, we'll explain Snowflake's data transfer costs, including data ingress and egress
fees, how they vary depending on cloud providers and regions, and how to optimize Snowflake
costs and maximize ROI.
Understanding Snowflake Data Transfer Costs
Snowflake calculates data transfer costs by considering criteria such as data size, transfer rate,
and the region or cloud provider from where the data is being transferred. Before tackling how
to reduce these costs, it is important to understand how Snowflake calculates them.
Snowflake Data Ingress and Egress Costs
Snowflake data transfer cost structure depends on two factors: ingress and egress.
Ingress, meaning transferring the data into Snowflake, is free of charge. Data egress, however,
or sending the data from Snowflake to another region or cloud platform incurs a certain fee. If
you transfer your data within the same region, it's usually free. When transferring externally
via external tables, functions, or Data Lake Exports, though, there are per-byte charges
associated that can vary among different cloud providers.
Take a look at some Snowflake pricing tables for comparison to get an idea of what the
difference in cost could look like:
Snowflake Data Transfer Costs in AWS:
Usage dashboard
How to Query Data for Snowflake Data Transfer Cost?
Before we begin querying the data for Snowflake data transfer costs, we must first understand
that Snowflake has two schemas for querying data transfer
costs: ORGANIZATION_USAGE and ACCOUNT_USAGE. These schemas contain data related
to usage and cost and provide very detailed and granular, analytics-ready usage data to build
custom reports or even dashboards.
ORGANIZATION_USAGE schema contains data about usage and costs at the
organization level, including storage usage and cost data for all organizational databases
and stages in a shared database named SNOWFLAKE. The ORGANIZATION USAGE
schema is only accessible to users with the ORGADMIN or ACCOUNTADMIN roles.
ACCOUNT_USAGE schema contains usage and cost data at the account level, including
storage consumption and cost for all tables inside an account.
Most views in the ORGANIZATION_USAGE and ACCOUNT_USAGE schemas show how
much it costs to send data based on how much data is transferred. To view cost in currency
rather than volume, write queries against the USAGE_IN_CURRENCY_DAILY View. This view
transforms the amount of data transferred into a currency cost based on the daily price of
transferring a TB.
The table below lists the views that contain usage and data transfer costs information from your
Snowflake account to another region or cloud provider. These views can be used to gain insight
into the costs associated with data transfer.
Views Description
FAQs
What factors influence the cost of Snowflake data transfer?
Snowflake data transfer cost structure depends on two factors: ingress and egress.
Is data ingress free in Snowflake?
Yes, transferring data into Snowflake (data ingress) is generally free of charge.
Does Snowflake charge for data egress?
Yes, there are fees associated with data egress in Snowflake. Sending data from Snowflake to
another region or cloud platform incurs a certain fee.
Will I incur data transfer costs if I move data within the same region?
No, there are no data transfer costs when moving data within the same region in Snowflake.
Can I dispute an unexpected spike in charges?
Yes, you can contact Snowflake support to review and dispute any unexpected data transfer
charges on your Snowflake bill
Snowflake Compute Costs 101 - An In-Depth Guide to Manage & Optimize Compute Costs
Snowflake, the leading cloud data platform widely used by businesses, is celebrated for its
unique pricing model and innovative architecture. Users are only charged for the compute
resources, storage and data transfer they use, enabling them to save huge on Snowflake costs.
However, this could lead to cost escalations if usage is not monitored carefully. To avoid such
issues and get maximum effectiveness out of Snowflake pricing, it is crucial to understand
Snowflake compute costs—the cost of using virtual warehouses, serverless computing and other
cloud services in Snowflake. These costs depends on the amount of Snowflake compute
resources used and the duration of their use; proper optimization might help businesses cut down
on their spending without adversely sacrificing Snowflake performance. This article provides
actionable tips and real-world examples to help you better understand Snowflake compute costs
and maximize your Snowflake ROI.
Understanding Snowflake Compute Costs
Snowflake compute costs are easy to comprehend, but before we even dive into that topic, we
need to get a decent understanding of what a Snowflake credit means.
Snowflake Credit Pricing
Snowflake credits are the units of measure used to pay for the consumption of resources within
the Snowflake platform. Essentially, they represent the cost of using the various compute
options available within Snowflake, such as Virtual Warehouse Compute, Serverless Compute,
and Cloud Services Compute. If you are interested in learning more about Snowflake pricing and
Snowflake credits pricing, you can check out this article for a more in-depth explanation.
Snowflake Virtual Warehouse Compute
Virtual warehouse in Snowflake is a cluster of computing resources used to execute queries, load
data and perform DML(Data Manipulation Language) operations. Unlike other traditional On-
Premise types of databases, which rely entirely on a fixed MPP(Massively Parallel
Processing) server, Snowflake's virtual warehouse is a dynamic group of virtual database servers
consisting of CPU cores, memory and SSD that are maintained in a hardware pool and can be
deployed quickly and effortlessly [1].
Snowflake credits are used to measure the processing time consumed by each individual virtual
warehouse based on the number of warehouses used, their runtime, and their actual size.
There are mainly two types of virtual warehouses:
1. SnowStandard Virtual Warehouse
2. Snowpark-optimized Virtual Warehouse
Standard Virtual Warehouse : Standard warehouses in Snowflake are ideal for most data
analytics applications, providing the necessary Snowflake compute resources to run efficiently.
The main key benefit of using standard warehouses is their flexibility—they can be started and
stopped at any moment as per your needs, without any restrictions. Also, standard warehouses
can be resized on the fly, even while they are currently under operation, to adjust to the varying
resource requirements of your data analytics applications, meaning that you can dynamically
allocate more or fewer compute resources to your virtual warehouse, depending on the type of
operations it is currently performing. In our upcoming article, we'll go into detail about how a
virtual warehouse works, including what snowflake pricing is, how much a virtual warehouse
costs, and what its benefits and best practices are.
The various sizes available for Standard virtual warehouse are listed below, indicating the
amount of compute resources per cluster. As you move up the size ladder, the hourly credit cost
for running the warehouse gets doubled.
Snowflake Warehouse Size (Source: docs.snowflake.com)
Snowpark-optimized Virtual Warehouse : Snowpark-optimized Virtual warehouse provides 16x
memory per node compared to a standard Snowflake virtual warehouse and is recommended for
processes with high memory requirements, such as machine learning training datasets on a single
virtual warehouse node[3].
Creating a Snowpark-optimized virtual warehouse in Snowflake is just as straightforward as
creating a standard virtual warehouse. You can do this by writing or copying the SQL script
provided below. The script creates a SNOWPARK_OPT_MEDIUM virtual warehouse of medium
size, "SNOWPARK-OPTIMIZED" warehouse type, which is optimized for Snowpark operations.
create or replace warehouse SNOWPARK_OPT_MEDIUM with
warehouse_size = 'MEDIUM'
warehouse_type = 'SNOWPARK-OPTIMIZED';
Automatic Clustering 2
External tables 2
Materialized views 10
Replication 2
Snowpipe 1.25
Jan 3 80 5 -5 80
Filter By Tag
Snowflake Credit Consumption by Type, Service, or Resource
When analyzing the compute history through the bar graph display, you can filter the data based
on three criteria:
By Type
By Service
By Resource.
Each filter provides a unique perspective on resource consumption and helps you get a more in-
depth understanding of the Snowflake pricing.
By Type allows you to separate the resource consumption into two categories:
Compute (including virtual warehouses and serverless resources)
Cloud services
This filter provides a clear distinction between the two types of compute resources.
Filter By Type
By Service separates resource consumption into warehouse consumption and consumption by
each serverless feature. Cloud services compute are included in the warehouse consumption.
Filter By Service
By Resource, on the other hand, separates resource consumption based on the Snowflake object
that consumed the credits.
Filter By Resource
Using SQL Queries for Viewing Snowflake Compute Costs
Snowflake offers two powerful
schemas: ORGANIZATION_USAGE and ACCOUNT_USAGE. These Schemas contain rich
information about Snowflake credit usage and Snowflake pricing. They provide granular usage
data, ready for analytics, and can be used to create custom reports and dashboards.
The majority of the views in these schemas present the cost of compute resources in terms of
credits consumed. But suppose you're interested in exploring the cost in currency rather than
credits. In that case, you can write queries using the USAGE_IN_CURRENCY_DAILY View,
which converts credits consumed into cost in currency using the daily price of credit.
The following views provide wide and important insights into the cost of compute resources and
help you to make data-driven decisions on resource usage and cost control. For more thorough
information on the cost of computing, refer to the following table:
View Compute Resource Schema
-- by hour
select date_part('HOUR', start_time) as start_hour, warehouse_name,
avg(credits_used_compute) as credits_used_compute_avg
from snowflake.account_usage.warehouse_metering_history
where start_time >= dateadd(day, -30, current_timestamp())
and warehouse_id > 0
group by 1, 2
order by 1, 2;
from "SNOWFLAKE"."ACCOUNT_USAGE"."AUTOMATIC_CLUSTERING_HISTORY"
from "SNOWFLAKE"."ACCOUNT_USAGE"."SEARCH_OPTIMIZATION_HISTORY"
where start_time >= dateadd(month,-1,current_timestamp())
group by 1,2,3,4
order by 5 desc;
Search Optimization History
This query calculates the average daily credits used for each week over the past year, then
proceeds to group the credits used by day, then groups the average daily credits by week, and
orders the results by the date the week starts.
with credits_by_day as (
select to_date(start_time) as date
,sum(credits_used) as credits_used
from "SNOWFLAKE"."ACCOUNT_USAGE"."SEARCH_OPTIMIZATION_HISTORY"
select date_trunc('week',date)
,avg(credits_used) as avg_daily_credits
from credits_by_day
group by 1
order by 1;
Calculating Snowflake Compute Costs: Materialized Views
Materialized Views cost history (by day, by object)
This query generates a complete report of Materialized Views and the number of credits utilized
by the service in the past 30 days, which is then sorted by day.
select
to_date(start_time) as date
,database_name
,schema_name
,table_name
,sum(credits_used) as credits_used
from "SNOWFLAKE"."ACCOUNT_USAGE"."MATERIALIZED_VIEW_REFRESH_HISTORY"
from "SNOWFLAKE"."ACCOUNT_USAGE"."MATERIALIZED_VIEW_REFRESH_HISTORY"
where start_time >= dateadd(year,-1,current_timestamp())
group by 1
order by 2 desc
)
select date_trunc('week',date)
,avg(credits_used) as avg_daily_credits
from credits_by_day
group by 1
order by 1;
Calculating Snowflake Compute Costs: Snowpipe
Snowpipe cost history (by day, by object)
This query provides a full list of pipes and the volume of credits consumed via the service over
the last 1 month, which is then sorted by day.
select
to_date(start_time) as date
,pipe_name
,sum(credits_used) as credits_used
from "SNOWFLAKE"."ACCOUNT_USAGE"."PIPE_USAGE_HISTORY"
where start_time >= dateadd(month,-1,current_timestamp())
group by 1,2
order by 3 desc;
Calculating Snowflake Compute Costs: Replication
Replication cost history
This query provides a full list of replicated databases and the volume of credits consumed via
the replication service over the last 1 month (30days) period.
select
to_date(start_time) as date
,database_name
,sum(credits_used) as credits_used
from "SNOWFLAKE"."ACCOUNT_USAGE"."REPLICATION_USAGE_HISTORY"
where start_time >= dateadd(month,-1,current_timestamp())
group by 1,2
order by 3 desc;
Replication History & m-day average
This query provides the average daily credits consumed by Replication grouped by week over
the last 1 year.
with credits_by_day as (
select to_date(start_time) as date
,sum(credits_used) as credits_used
from "SNOWFLAKE"."ACCOUNT_USAGE"."REPLICATION_USAGE_HISTORY"
select date_trunc('week',date)
,avg(credits_used) as avg_daily_credits
from credits_by_day
group by 1
order by 1;
Calculating Snowflake Compute Costs: Cloud Services
Warehouses with high cloud services usage
This query shows the warehouses that are not using enough warehouse time to cover the cloud
services portion of compute.
select
warehouse_name
,sum(credits_used) as credits_used
,sum(credits_used_cloud_services) as credits_used_cloud_services
,sum(credits_used_cloud_services)/sum(credits_used) as
percent_cloud_services
from "SNOWFLAKE"."ACCOUNT_USAGE"."WAREHOUSE_METERING_HISTORY"
where to_date(start_time) >= dateadd(month,-1,current_timestamp())
and credits_used_cloud_services > 0
group by 1
order by 4 desc;
FAQs
What is a virtual warehouse in Snowflake?
Virtual warehouse in Snowflake is a cluster of computing resources used to execute queries, load
data and perform DML(Data Manipulation Language) operations. It consists of CPU cores,
memory, and SSD and can be quickly deployed and resized as needed.
What is the pricing model for Snowflake virtual warehouses?
Snowflake virtual warehouses are priced based on the number of warehouses used, their runtime,
and their actual size. The pricing varies for different sizes of virtual warehouses.
How are serverless compute costs calculated in Snowflake?
Serverless compute costs in Snowflake are calculated based on the total usage of the specific
serverless feature. Each serverless feature consumes a certain number of Snowflake credits per
compute hour.
What are some serverless features in Snowflake?
Serverless features in Snowflake include automatic clustering, external tables, materialized
views, query acceleration service, search optimization service, Snowpipe, and serverless tasks.
What is Snowflake Cloud Services Compute?
Snowflake Cloud Services Compute refers to a combination of services that manage various
tasks within Snowflake, such as user authentication, query optimization, and metadata
management.
How does Snowflake pricing work for Cloud Services?
Cloud service usage is paid for with Snowflake credits, and you only pay if your daily cloud
service usage exceeds 10% of your total virtual warehouse usage.
Can I view Snowflake compute costs in currency instead of credits?
Yes, Snowflake provides the USAGE_IN_CURRENCY_DAILY view to convert credits
consumed into cost in currency using the daily price of credits.
Is Snowflake compute cost expensive?
Snowflake compute cost can be expensive, but it depends on your specific workloads. If you
only need a small amount of compute power, Snowflake can be a cost-effective option.
However, if you need a lot of compute power, Snowflake can be more expensive than traditional
on-premises data warehouses
Snowflake Zero Copy Clone 101—An Essential Guide (2024)
Snowflake zero copy clone is an incredibly useful and advanced feature that allows users to
clone a database, schema, or table quickly and easily without any additional Snowflake storage
costs. What's more, it takes only a few minutes for Snowflake zero copy clone to complete
without the need for complex manual configuration, as often done in conventional databases—
depending on the size of the source item. This article covers all you need to know about
Snowflake zero copy clone.
Let's dive in!
What is Snowflake zero copy clone?
Snowflake zero copy clone, often referred to as "cloning", is a feature in Snowflake that
effectively creates an exact copy of a database, table, or schema without consuming extra
storage space, taking up additional time, or duplicating any physical data. Instead, a logical
reference to the source object is created, allowing for independent modifications to both the
original and cloned objects. Snowflake zero copy cloning is fast and offers you maximum
flexibility with no additional Snowflake storage costs associated with it.
Use-cases of Snowflake zero copy clone
Snowflake zero copy clone provides users with substantial flexibility and freedom, with use
cases like:
To quickly perform backups of Tables, Schemas, and Databases.
To create a free sandbox to enable parallel use cases.
To enable quick object rollback capability.
To create various environments (e.g., Development,Testing, Staging, etc.).
To test possible modifications or developments without creating a new environment.
Snowflake zero copy clone provides businesses with smarter, faster, and more flexible data
management capabilities.
How does Snowflake zero copy clone work?
The Snowflake zero copy clone feature allows users to clone a database object without making a
copy of the data. This is possible because of the Snowflake micro-partitions feature, which
divides all table data into small chunks that each contain between 50 and 500 MB of
uncompressed data. However, the actual size of the data stored in Snowflake is smaller because
the data is always stored compressed. When cloning a database object, Snowflake simply creates
new metadata entries pointing to the micro-partitions of the original source object, rather than
copying it for storage. This process does not involve any user intervention and does not
duplicate the data itself—that's why it's called "zero copy clone".
To gain a better understanding, let's deep dive even further.
To illustrate this, consider a database table, EMPLOYEE table, and its cloned
snapshot, EMPLOYEE_CLONE, in a Snowflake database. The metadata layer in Snowflake
connects the metadata of EMPLOYEE to the micro-partitions in the storage layer where the
actual data resides. When the EMPLOYEE_CLONE table is created, it generates a new metadata
set pointing to the same micro-partitions storing the data for EMPLOYEE. Essentially, the
clone EMPLOYEE_CLONE table is a new metadata layer for EMPLOYEE rather than a physical
copy of the data. The beauty of this approach is that it enables us to create clones of tables
quickly without duplicating the actual data, saving time and storage space. Moreover, since the
clone shares the same set of micro-partitions as the original table, any changes made to the data
in one table will automatically reflect in the other.
FAQs
Why is it called zero copy clone?
The term "Zero Copy Clone" is used because Snowflake's cloning process doesn't involve
physical data copying. It creates a reference to the source data, eliminating the need for
duplication and resulting in zero additional storage costs.
How does Snowflake zero copy clone work?
Snowflake zero copy clone works by creating new metadata entries that point to the micro-
partitions of the original source object instead of making a physical copy of the data.
What are the advantages of zero copy cloning Snowflake?
Effective data cloning without physical duplication, saving time.
Storage space and cost savings as it doesn't consume additional storage.
Hassle-free cloning process using the "CLONE" keyword.
Single-source data management with new metadata for each clone.
Maintaining data security and access controls.
What are the limitations of Snowflake zero copy clone?
Resource requirements and potential performance impact.
Longer clone time for tables with a large number of micro-partitions.
Not all object types are supported for cloning.
Which objects are supported in Snowflake Zero Copy Cloning?
Databases
Schemas
Tables
Views
Materialized views
Sequences
Can Snowflake objects be cloned?
Yes, individual external named stages in Snowflake can be cloned. External stages refer to
buckets or containers in external cloud storage. Cloning an external stage does not affect the
referenced cloud storage. However, internal (Snowflake) named stages cannot be cloned.
Can you clone Internal named stages ?
No, Internal named stages cannot be cloned.
How does Zero Copy Cloning save time and money?
Zero Copy Cloning eliminates the need for creating multiple development environments in
separate accounts, reducing costs and time spent on creating large copies of production tables.
Snowflake Resource Monitors 101: A Comprehensive Guide (2024)
Snowflake Resource Monitors is a powerful feature that helps in Snowflake monitoring and
avoid unexpected credit usage. It's the only official monitoring tool from Snowflake that lets you
monitor your credit consumption and control your warehouses, making it essential for
Snowflake users. In this article, we’ll explain how the Resource Monitors work, how to set it up,
how to customize notifications and define actions—and much more!!
What is Snowflake Resource Monitors?
Snowflake Resource Monitors is a feature in Snowflake that helps you manage and control
Snowflake costs. It allows you to monitor and set limits on your compute resources, so you can
avoid overspending and potentially going over your budget. By setting up Resource Monitors,
you can define actions to be taken when usage thresholds are exceeded, such as sending alerts or
suspending warehouses. This gives you more control over your Snowflake data warehouse and
helps you stay within your budget.
What Features Do Snowflake Resource Monitors Offer?
Snowflake Resource Monitors is a great feature for anyone using Snowflake. It offers a wealth
of features that can help you better monitor and manage your Snowflake workloads and
resources, giving you more control over your Snowflake environment.
Some of the important features of the Snowflake Resource Monitors are:
Credit usage monitoring: It allows you to monitor your credit usage in real-time, so you
can stay updated on your account's resource consumption.
Multiple levels of control: Snowflake Resource Monitors can be set at the account level
or warehouse level, giving users high flexibility and control over their resources.
Credit quota management: It allows you to set credit quotas for specific warehouses and
even for the entire account to help you stay within your budget and avoid unexpected
Snowflake costs.
Custom notification + alerting: It can send notifications and alerts to users when credit
usage reaches specific thresholds.
Custom Actions: It allows you to configure custom actions that can be triggered when
credit usage is exceeded (such as by suspending a warehouse or sending an alert).
How to setup Snowflake Resource Monitors using Snowflake Web UI?
Setting up Snowflake Resource Monitors is a straightforward process. To make one in
Snowflake, follow the steps outlined below:
Step 1: Log in to Snowflake Web UI
To setup Snowflake Resource Monitors, you need to log in to Snowflake Web UI.
Credit Quota
Step 5: Configure the Monitor Type property
Next, let's configure the Monitor Type property. The Snowflake Resource Monitor can monitor
credit usage at the Account and Warehouse levels. If you have selected the Account level, the
Resource Monitors will monitor the credit usage for your entire account (i.e., all warehouses in
the account), and if you select the Monitor Type as Warehouse level, you need to individually
select the warehouses to monitor.
Account level
Account level
monitor
Warehouse level
Warehouse level
monitor
Step 6: Configure the Schedule property
Now that you have configured the Monitor Type property, let's configure the Schedule property.
By default, the scheduling of the Resource Monitor is set to begin monitoring immediately and
reset back the credit usage to 0 at the beginning of each calendar month. However, you can
customize the scheduling of the Resource Monitor to your needs.
Time Zone: You have two options to set the schedule's time zone: Local and UTC.
Starts: You can start the Resource Monitor immediately or later. If you choose Later,
you should enter the date and time for the Resource Monitor to start.
Resets: You can choose the frequency interval at which the credit usage resets. The
supported values are Daily, Weekly, Monthly, Yearly, and Never.
Ends: You can run the Resource Monitor continuously by selecting the "never" reset
option, or you can set it to stop at a certain date and time.
Customize
Resource Monitor schedule
Step 7: Configure the Actions property
Now that you have set up the basic properties of your Resource Monitor, it's time to define
the Actions property that will be taken when the credit usage threshold gets exceeded. As you
can see in the screenshot below, In the Actions property, there are three types of actions that
Resource Monitors provide by default:
Resource Monitors
actions configuration
Resource Monitors actions configuration
Suspend Immediately: This action suspends all the assigned warehouses immediately,
which cancels any query statements being executed by the warehouses at that time.
Suspend and notify: This action suspends all the assigned warehouses after all statements
being executed by the warehouse(s) have been completed and then sends an alert
notification to the users.
Notify: This action sends an alert notification to all users with notifications enabled.
Note: You can define up to three actions for a Resource Monitor, which include
one Suspend and Notify action, one Suspend Immediately action, and up to five
custom Notify actions. Resource Monitor must have at least one action defined. If no
actions have been defined, you won't be able to create your Resource Monitor.
Step 8: Create the Resource Monitor
As you can see in the screenshot below, we have successfully configured the Resource Monitor.
So, once you've configured it, click the "Create Resource Monitor" button to create it. As you
can see, we created a Snowflake Resource Monitor with the name "ResourceMonitor_DEMO," a
Credit Quota of "200" at the warehouse level, one warehouse configured with the default
schedule, Suspend Immediately and Notify action set to trigger at 80% of credit usage, Suspend
and Notify action set to trigger at 75% of credit usage, and Finally Notify action set to trigger
at 50% & 60% of credit usage.
Resource Monitors
configuration
FAQs
How does Snowflake Resource Monitors help control Snowflake costs?
Snowflake Resource Monitors helps you manage and control Snowflake costs. It allows you to
monitor and set limits on your compute resources, so you can avoid overspending and
potentially going over your budget.
How to setup resource monitor in Snowflake using Snowflake Web UI?
To set up Snowflake Resource Monitors using Snowflake Web UI, log in, navigate to the
Resource Monitors page under the Admin tab, and click on the "+Resource Monitors" button.
What are the steps to configure a Resource Monitor's credit quota and schedule?
To configure a Resource Monitor's credit quota and schedule, specify the desired values in the
New Resource Monitor window, including credit quota, monitor type (account or warehouse
level), and schedule properties.
Can I define custom actions for Snowflake Resource Monitors?
Yes, Snowflake Resource Monitors allow you to define custom actions to be taken when credit
usage thresholds are exceeded, such as suspending warehouses or sending alerts.
Can I create Snowflake Resource Monitors using SQL queries?
Yes, Snowflake Resource Monitors can be created using SQL queries, using this simple query:
CREATE Resource Monitors "ResourceMonitor_DEMO" WITH CREDIT_QUOTA = 300
TRIGGERS
ON 80 PERCENT DO SUSPEND
ON 75 PERCENT DO SUSPEND_IMMEDIATE
ON 60 PERCENT DO NOTIFY
ON 50 PERCENT DO NOTIFY;
Can Resource Monitors be set at both the account and warehouse levels?
Yes, Resource Monitors can be set at both the account and warehouse levels, providing
flexibility and control over resource management.
Which schema has the resource monitor view in Snowflake?
This Account Usage view displays the resource monitors that have been created in the reader
accounts managed by the account. This view is only available in the
READER_ACCOUNT_USAGE schema.
Removing all the occurrences of a specified substring by using an empty string as a replacement
argument - Snowflake REPLACE
As you can see, Snowflake REPLACE replaces the substring “with Chaos Genius” in the
subject “Slash your Snowflake spend with Chaos Genius” with an empty string, and returns
the new string “Slash your Snowflake spend”.
Now, in the next section, we will compare the Snowflake REPLACE() function with another
similar function, REGEXP_REPLACE()
What Is the Difference Between Snowflake REPLACE and REGEXP_REPLACE?
Snowflake REPLACE() is not the only way to replace substrings in a string value. There is
another similar function called REGEXP_REPLACE that also allows you to perform
replacement operations but with some additional features and flexibility.
Syntax for REGEXP_REPLACE() is straightforward:
REGEXP_REPLACE( <subject> , <pattern> [ , <replacement> , <position> ,
<occurrence> , <parameters> ] )
Snowflake REGEXP_REPLACE function takes 6 (4 of ‘em are optional) arguments:
<subject>: String value to be searched and modified.
<pattern>: Regular expression to be searched for and replaced in the subject.
<replacement>: Substring to replace the matched pattern in the subject. If an empty
string is specified, the function removes all matched patterns and returns the resulting
string.
<position>: Number of characters from the beginning of the string where the function
starts searching for matches.
<occurrence>: Number of occurrences of the pattern to be replaced in the subject. If 0 is
specified, all occurrences are replaced.
<parameters>: A string of characters that specify the behavior of the function. This is
an optional argument that supports one or more of the following characters:
c: Enables case-sensitive matching.
i: Enables case-insensitive matching.
m: Enables multi-line mode. By default, multi-line mode is disabled.
e: Extracts sub-matches.
s: Enables the POSIX wildcard character. to match \n. By default, wildcard character
matching is disabled.
For more details, see regular expression parameters
Now, let's dive into the main difference between Snowflake REPLACE and
REGEXP_REPLACE. The primary difference is that Snowflake REGEXP_REPLACE uses
regular expressions to match the pattern, while Snowflake REPLACE uses literal strings. Here is
a table that quickly summarizes the differences between the Snowflake REPLACE and
REGEXP_REPLACE functions.
Snowflake REPLACE() replaces all occurrences of REGEXP_REPLACE() returns the subject with t
a specified substring, and optionally replaces them specified pattern either removed or replaced by
with another string replacement string
Handling NULL
values with Snowflake REPLACE
SELECT REPLACE('string', NULL, 'new_string');
Snowflake REPLACE example
Creating gadgets table and inserting some dummy data into it - Snowflake REPLACE
Here is what our gadgets table looks like:
Selecting all the records of the Snowflake gadgets table - Snowflake REPLACE
Example 1—Basic Usage of Snowflake REPLACE Function
Now, we will use the Snowflake REPLACE function to replace substrings in a string. Suppose
we want to change the name of the product “IPhone 15 to “Apple IPhone 19” in the name
column.
We can use the following query:
SELECT id, REPLACE(name, 'IPhone 15', 'Apple IPhone 19') AS name,
description, price
FROM gadgets
WHERE id = 1;
Snowflake REPLACE example
Creating gadgets_order
table and inserting some dummy data into it - Snowflake REPLACE
Here is what our gadgets_order table looks like:
Snowflake REPLACE() works with strings of any TRANSLATE() performs single character
length substitutions
TLDR; TRANSLATE() is best for fast bulk character substitutions, while Snowflake
REPLACE() enables more advanced string manipulation with greater flexibility.
When to Use Snowflake REPLACE Function?
Finally, in this section, we will discuss when to use the Snowflake REPLACE function and
explore its benefits and limitations. Here are common use cases where Snowflake REPLACE()
can be useful for:
1) Removing or Replacing Substrings
Use Snowflake REPLACE to eliminate unwanted characters, words, or phrases from a string by
substituting them with an empty string (''). Also, it helps in replacing existing substrings and
altering names, formats, or even styles within a string.
2) Data Cleansing and Transformation
You can use Snowflake REPLACE to correct spelling errors, typos, or inconsistencies within
your data. Standardize data formats like dates, numbers, or currencies, thereby enhancing data
quality and accuracy.
3) Dynamic String Manipulation
Use Snowflake REPLACE for string operations based on expressions or variables. For instance,
concatenate strings, split or extract substrings, and generate new strings based on conditions or
logic.
4) Simple String Alterations
You can implement Snowflake REPLACE for straightforward changes to a string, such as
adding or removing prefixes or suffixes, altering the case, or reversing the order. This function
streamlines string modifications easily.
5) Control Over Replacement Logic
You can use the REPLACE function to customize the replacement operation according to your
needs. You can control the case sensitivity, the number of replacements, and the position of the
replacement within the string. The REPLACE function gives you more flexibility and
customization options than other similar functions (Like TRANSLATE()).
Conclusion
And that’s a wrap! Snowflake REPLACE() is a powerful function that allows you to replace all
occurrences of a specified substring in a string value with another substring. Snowflake
REPLACE() can help you perform various data manipulation and transformation tasks, such as
cleansing, standardization, extraction—and analysis. As we saw in the examples above,
Snowflake REPLACE() can be used for simple tasks like correcting typos as well as more
complex tasks like dynamic string manipulation. But, you should always be aware of how
Snowflake REPLACE() handles the null values and the case sensitivity of the arguments.
In this article, we covered:
What Is Snowflake REPLACE() Function?
How Does a Snowflake REPLACE() Function Work?
What Is the Difference Between Snowflake REPLACE and REGEXP_REPLACE?
How does Snowflake REPLACE Function Handle Null Values?
Practical Examples of Snowflake REPLACE Function
What Is the Difference Between TRANSLATE() and Snowflake REPLACE()?
When to Use Snowflake REPLACE Function?
…and so much more!
By now, you should be able to use Snowflake REPLACE to manipulate and transform your
string data effectively. It's simple to use, yet customizable for diverse needs, helping you to
effectively cleanse, transform—and standardize your string/text data.
FAQs
What is the Snowflake REPLACE() function?
Snowflake REPLACE() is a string function that finds and replaces a specified substring with a
new substring in a string value. It replaces all occurrences of the specified substring.
Does Snowflake REPLACE() replace all occurrences or just the first one?
Snowflake REPLACE() will replace all occurrences of the specified substring, not just the first
match.
Is Snowflake REPLACE() case-sensitive?
Yes, Snowflake REPLACE() performs case-sensitive matches by default.
How does Snowflake REPLACE() handle NULL values?
If any Snowflake REPLACE() argument is NULL, it returns NULL without performing any
replace.
Can Snowflake REPLACE() insert new characters?
Yes, the replacement string can contain new characters not originally present.
Can Snowflake REPLACE() be used to remove substrings?
Yes, you can remove substrings by replacing ‘em with an empty string.
When would Snowflake REPLACE() be useful for data cleansing?
Snowflake REPLACE can correct invalid data entries, standardize formats, and fix
typos/inconsistencies to improve data quality.
Can I use column values or expressions as arguments in Snowflake REPLACE()?
Yes, you can use column names or expressions that evaluate to a string instead of just literal
strings.
Is there a limit on the string length supported by Snowflake REPLACE()?
The maximum string length is 16MB (16777216 characters) which is the Snowflake
STRING/VARCHAR limit.
How is Snowflake TRANSLATE() different from Snowflake REPLACE()?
TRANSLATE does single-character substitutions while Snowflake REPLACE works on entire
strings.
Can I use Snowflake REPLACE() to concatenate or split strings?
Yes, Snowflake REPLACE can be used alongside other string functions like CONCAT or SPLIT
for such operations.
When should I avoid using Snowflake REPLACE()?
Avoid Snowflake REPLACE if you need very high performance—use TRANSLATE instead.
Also if you need more advanced regex patterns use the REGEXP_REPLACE function.
Is Snowflake REPLACE() case-sensitive by default?
Yes, Snowflake REPLACE performs case-sensitive literal substring matching by default.
Can I make Snowflake REPLACE() case-insensitive?
Yes, you can by converting the string to the same case before applying REPLACE.
Does Snowflake REPLACE() support regex or wildcards?
No, Snowflake REPLACE does not support regex or wildcards, only literal substring matching.
Can Snowflake REPLACE() insert strings that don't exist in the original?
Yes, the replacement string can contain new characters not originally present.
What data types does Snowflake REPLACE() support?
Snowflake REPLACE works on STRING, VARCHAR, CHAR, TEXT, and similar
string/character data types.
What Is the Difference Between Snowflake IFF and Snowflake CASE?
IFF is another Snowflake conditional expression function similar to Snowflake CASE. The key
differences are:
It is simply a single-level
Snowflake CASE handles Complex conditional logic
then-else expression
Syntax: CASE WHEN <condition1> THEN <result1> [ WHEN Syntax: IFF( <condition>
<condition2> THEN <result2> ] [ ... ] [ ELSE <result3> ] END <expr1> , <expr2> )
FAQs
What is a Snowflake CASE statement?
Snowflake CASE is a conditional expression in Snowflake that allows you to perform different
computations based on certain conditions.
How does CASE work in Snowflake?
Snowflake CASE evaluates conditions in sequence and returns the result of the first matching
condition. An optional ELSE clause specifies a default result.
Can I use Snowflake CASE in a SELECT query in Snowflake?
Yes, Snowflake CASE can be used in SELECT, INSERT, UPDATE and other statements
anywhere an expression is valid.
How do I check for NULL values in a CASE statement?
You can use IS NULL or IS NOT NULL to explicitly check for nulls, as nulls do not match
other nulls in CASE.
Can I nest Snowflake CASE statements in Snowflake?
Yes, you can nest Snowflake CASE statements to create multi-level conditional logic.
What is an alternative to CASE in Snowflake?
DECODE is an alternative that compares an expression to a list of values to return a match.
When should I use Snowflake CASE instead of DECODE in Snowflake?
Snowflake CASE provides more flexibility for complex logic with multiple conditions.
DECODE is simpler for basic value matching.
What is the difference between CASE and IFF in Snowflake?
CASE evaluates all conditions, IFF evaluates only the first match. CASE can check multiple
values, IFF cannot.
How can I handle errors with Snowflake CASE?
Always check for errors and exceptions explicitly in Snowflake CASE conditions, and return
appropriate messages.
Can Snowflake CASE improve performance ?
Snowflake CASE can improve readability over nested IFs. However, joining may be faster than
complex Snowflake CASEs.
What are some common uses of CASE in Snowflake?
Data transformations, conditional aggregation, pivoting, error handling, and business logic.
What data types can I use with Snowflake CASE?
Snowflake CASE results and return values can be any Snowflake data type.
Is Snowflake CASE statement support standard SQL?
Yes, CASE conditional expressions are part of the ANSI SQL standard.
Are there any limitations with Snowflake CASE?
No major limitations. Just watch for performance with over complex logic.
Can I use subqueries in a Snowflake CASE?
Yes, Snowflake supports using subqueries in CASE conditions and return values.
Advanced Snowflake Interview Questions and Answers:
Going beyond the basics, these advanced Snowflake interview questions probe deeper into
your understanding and expertise:
Performance Optimization:
How would you optimize a slow-running query in Snowflake? Analyze
the query plan, identify bottlenecks (e.g., joins, subqueries), and
consider: Clustering: Cluster tables based on frequently used join
columns. Materialized views: Pre-calculate frequently used queries for
faster access. Partitioning: Divide data into smaller subsets based on
specific criteria. Indexing: Create indexes on frequently queried columns.
Explain Snowflake's various caching mechanisms and how they impact
performance. Discuss result cache, local disk cache, and remote disk cache.
Explain how they store frequently accessed data for faster retrieval and
reduce query execution time.
How would you monitor and troubleshoot performance issues in
Snowflake? Utilize Snowflake monitoring tools like Query History,
Warehouse History, and Workloads to identify slow queries, resource
usage, and bottlenecks.
Security & Compliance:
Describe your approach to implementing data security and access
control in Snowflake. Discuss strategies for user authentication,
authorization (roles, privileges), data encryption, and activity monitoring.
Mention relevant security features like Dynamic Data Masking and Row
Level Security.
How would you ensure compliance with data privacy regulations like
GDPR or CCPA in Snowflake? Explain data anonymization and
pseudonymization techniques. Discuss how Snowflake's features like Data
Masking and Secure Data Sharing can help comply with regulations.
How would you handle a potential data breach in Snowflake? Describe
your incident response plan, including data loss assessment, notification
procedures, and remediation steps. Mention Snowflake's security features
like Security Alerts and User Activity Logs that can aid in such situations.
Advanced Features & Use Cases:
Explain the concept of Snowflake Streams and how you would use
them for real-time data processing. Discuss how Streams ingests and
processes data in real-time, enabling applications like anomaly detection
and fraud prevention.
Describe Snowflake's capabilities for machine learning and data
science. Explain Snowflake's integration with external ML tools and its
native features like User Defined Functions (UDFs) for building and
deploying ML models.
How would you design a Snowflake data pipeline for a complex data
processing scenario? Discuss your approach, including data sources,
ETL/ELT tools, data transformation steps, and Snowflake integration.
Showcase your understanding of data pipelines and their role in modern
data architectures.
Remember:
Go beyond basic definitions and provide detailed explanations with
practical examples.
Showcase your problem-solving skills by explaining your thought
process and approach.
Demonstrate your understanding of the latest Snowflake features and
functionalities.
Be confident and articulate when expressing your knowledge and
expertise.
By preparing for these advanced questions and showcasing your in-depth understanding of
Snowflake, you can impress interviewers and solidify your position as a sought-after
Snowflake professional.
1. What are the different types of Stages?
Stages are commonly referred to as the storage platform used to store the files.
In Snowflake, there are two types of stages:
1. Internal stage — Resides in the Snowflake storage
2. External stage — Resides in any of the cloud object storage (AWS S3, Azure
Blob, GCP bucket )
Data can be retrieved from the stage or transferred to the stage using the COPY
INTO command.
For BULK loading you can use COPY INTO and for continuous data loading
you need to use SNOWPIPE, an autonomous service provided by Snowflake.
To load data from the local file system into snowflake you can use
the PUT command.
2. What is Unique about Snowflake Cloud Data Warehouse?
Snowflake has introduced many unique features that are not been used in any
of the other data warehouses currently in the market.
1. Totally cloud-agnostic ( SAAS )-
Snowflake relies on 3 cloud service providers (AWS, Azure, GCP) for its
underlying infrastructure. It provides true SAAS functionality where the user
does not require to download or install any kind of software to use snowflake or
need to worry about any kind of hardware.
2. Decoupled storage and compute-
By decoupled it means storage and computes are work separately and work
collaboratively with the interface provided by the cloud provider. This helps in
decreasing usage costs where the user pays only what he is using.
3. Zero copy cloning-
This feature is used to take a snapshot of the table at the current instance to take
a backup of the table. The snapshot taken will not consume any physical space
in the data storage unless any changes have been done on the clone object. This
will occupy the same columnar partition used by the source table. Once changes
are done on the cloned object they will be stored in the different micro
partitions.
4. Secure data sharing-
This feature provides secure sharing of the data with different snowflake
accounts or users outside of the snowflake account. By secure, it means you can
assign authorized users to access any particular table in order to keep the table
secured from the rest of the snowflake users. The shared objects are always in
Read-only mode. You can create a Reader account to share data with the user
who is not using Snowflake.
5. Supports semi-structured data-
Snowflake supports file formats such as JSON, AVRO, ORC, PARQUET, and
XML. The variant data type is used to load semi-structured data into snowflake.
Once loaded it can be separated into multiple columns as a table.
The variant has a limit of 16MB for an individual row. Flatten function is used
to split the nested attributes into separate columns.
6. Scalability-
As Snowflake is built upon cloud infrastructure, it uses cloud services for
storage and computing. The warehouse is a VM that is used to carry out the
computation required to execute any query. This enables users the ability to
scale up resources when they need large amounts of data to be loaded faster
and scale back down when the process is finished without any interruption to
service.
7. Time-travel and Failsafe-
Time-travel is to retrieve snowflake objects which are removed/dropped from
snowflake. You can read/retrieve data that is deleted within a permissible time
frame using time travel.
CDP lifecycle
Using Time Travel, you can perform the following actions within a defined period of time:
1. Query data in the past that has since been updated or deleted.
2. Create clones of entire tables, schemas, and databases at or before specific points in the
past.
3. Restore tables, schemas, and databases that have been dropped.
3. What are the different ways to access the Snowflake Cloud Data warehouse?
Snowflake provides WebUI to access snowflake as well as SnowSQL to execute SQL queries
and perform all DDL and DML operations including data loading and unloading. It also
provides native connectors for Python, Spark, Go, Nodejs, JDBC, and ODBC.
4. What are the data security features in Snowflake?
Snowflake provides below security features:
1. Data encryption
2. Object-level access
3. RBAC
4. Secure data sharing
5. Masking policies for sensitive data
5. What are the benefits of Snowflake Compression?
Snowflake stores files in storage as compressed by default as gzip format which helps to
reduce the storage space occupied by that file also improves the data loading and unloading
performance. It also detects compressed file formats such as gzip,bzip2,deflate,raw_deflate.
6. What is Snowflake Caching? What are the different types of caching in
Snowflake?
It comprises three types of caching :
1. Result cache- This holds the results of every query executed in the past 24 hours.
2. Local disk cache- This is used to cache data used by SQL queries. Whenever data is
needed for a given query it’s retrieved from the Remote Disk storage, and cached in SSD
and memory.
3. Remote cache- Which holds the long-term storage. This level is responsible for data
resilience, which in the case of Amazon Web Services, means 99.999999999% durability.
Even in the event of an entire data center failure.
Snowflake cache
7. Is there a cost associated with Time Travel in Snowflake?
Yes, Time travel is the feature provided by snowflake to retrieve data that is removed from
Snowflake databases.
using time travel you can do :
1. Query data in the past that has since been updated or deleted.
2. Create clones of entire tables, schemas, and databases at or before specific points in the
past.
3. Restore tables, schemas, and databases that have been dropped.
Once the Time travel period is over, data is moved to the Fail-safe zone.
For the snowflake standard edition, the default Time travel period is 1.
For the snowflake Enterprise edition,
for transient and temp DB, schema, tables, the default time travel period is 1.
for permanent DB, schema, tables, and views, the default time travel can ranges from 1to 90
days.
8. What is fail-safe in Snowflake
When the time-travel period elapses, removed data moves to Fail-safe zone of 7 days for
Ent. edition snowflake and above. Once data went to Failsafe, we need to contact Snowflake
in order to restore the data. It may take from 24 hrs to days to get the data. The charges will
occur from where the state of the data is changed on basis of 24 Hr.
9. What is the difference between Time-Travel vs Fail-Safe in Snowflake
Time travel has a time period ranging from 0 to 90 days for permanent DB, schema, and
tables where Fail safe time is of 7 days only.
Once the table/schema is dropped from the SF account it will get into Time travel according
to the time travel duration of that object (0–90) days.
Once TT is elapsed, objects move into the Fail-safe zone.
Snowflake provides us with 3 methods of time travel –
a. Using Timestamp — We can do time travel to any point of time before or after the
specified timestamp.
b. Using Offset — We can do time travel to any previous point in time.
c. Using Query ID — We can do time travel to any point of time before or after the specified
Query ID.
Now lets drop the table :
and then we can read the data from the table again :
10. How does zero-copy cloning work and what are its advantage
Zero copy cloning is just like a creating clone of the snowflake object.
You can create clones of SF objects such as DB, schema, table, stream, stage, file formats,
sequence, and task.
when you create a clone, Snowflake will point the metadata of the source object to cloned
object depicting cloning until you make any changes to cloned object.
1. Main advantage of this is it creates a copy of the object in less time.
2. It does not consume any extra space if no updates happen on the cloned object.
3. fast way to take backup of any object.
Syntax:
create table orders_clone clone orders;
11. What are Data Shares in Snowflake?
Data sharing is the feature provided by snowflake to share data across snowflake accounts
and people outside of the snowflake accounts. You can share data according to the customized
datasets shared. For people outside snowflake, you need to create a reader account with
access to only read the data.
Below are the objects that can be shared:
Tables
External tables
Secure views
Secure materialized views
Secure UDFs
There are two types of users:
1. Data provider: The provider creates a share of a database in their account and grants access
to specific objects in the database. The provider can also share data from multiple databases,
as long as these databases belong to the same account.
2. Data consumer: On the consumer side, a read-only database is created from the share.
Access to this database is configurable using the same, standard role-based access control that
Snowflake provides for all objects in the system.
12. What is Horizontal scaling vs Vertical scaling in Snowflake.
Snowflake Enterprise and the above versions support a multi-cluster warehouse where you
can create a multi-cluster environment to handle scalability. The warehouse can be scaled
horizontally or vertically.
The multicluster warehouse can be configured in two ways :
1. Maximized mode: Where min amount and max amount of clusters are the same but (1 <
cluster size ≤10)
Maximized mode
2. Auto-Scale mode: Where min amount and max amount of clusters are different ( min = 2
and Max=10)
Auto-Scale mode
You can manually change your warehouse according to your query structure and complexity.
Below are the scaling methods available in snowflake.
Vertical scaling :
scaling up: Increasing the size of the warehouse (small to medium)
scaling down: decreasing the size of the warehouse (medium to small)
Horizontal scaling:
Scaling in:
Removing unwanted clusters from warehouse limit. (4 → 2)
scaling out:
Adding more clusters to the existing list of warehouses. (2 → 4)
12. Where is metadata stored in Snowflake?
Once the table is created in Snowflake, it generates metadata bout the table containing a
count of the rows, the date-time stamp on which it gets created, and aggregate functions
such as sum, min, and a max of numerical columns.
Metadata is stored in S3 where snowflake manages the data storage.
that's why while querying the metadata, there is no need of running a warehouse.
13. Briefly explain the different data security features that are available in
Snowflake
Multiple data security options are available in snowflake such as :
1. Secure view
2. Reader account
3. Shared data
4. RBAC
14. What are the responsibilities of a storage layer in Snowflake?
The storage layer is nothing but the cloud storage service where data resides.
It has responsibilities such as :
1. Data protection
2. Data durability
3. Data Encryption
4. Archival of Data
15. Is Snowflake an MPP database
Yes. By MPP it means Massively Parallel processing. Snowflake is built on the cloud so it
inherits the characteristics of the cloud such as scalability. It can handle parallel running
queries by adding necessary compute resources.
Snowflake supports shared-nothing architecture where the compute env is shared between
the users. When the query load increases, it automatically creates multiple clusters on nodes
capable of handling the complex query logic and execution.
16. Explain the different table Types available in Snowflake:
It supports three types of tables :
1. Permanent :
Permanent tables are the default type of tables getting created in snowflake. It occupies the
storage in cloud storage. The data stored in a permanent table gets partitioned into micro-
partitions for better data retrieval. This type of table has better security features such as Time
travel
The default time travel period for the permanent table is 90 days.
2. Temporary: Unlike permanent tables, temporary tables do not occupy the storage. All the
data stays temporarily in the memory. It holds the data only for that particular session.
3. Transients: Transient tables are similar to temporary with respect to the time travel period
but the only difference is transient tables need to be dropped manually. They will not get
dropped until explicitly dropped.
17. Explain the differences and similarities between Transient and Temporary
tables
18. Which Snowflake edition should you use if you want to enable time travel for
up to 90 days :
The Standard edition supports the time travel period of up to 1 day. For time travel of more
than 1 day for the permanent table, we need to get a Snowflake edition higher than standard.
All snowflake editions support only one day of time travel by default.
19. What are Micro-partitions :
Snowflake has its unique way of storing the data in cloud storage. Snowflake is a columnar
data warehouse as it stores data in columnar format. By columnar, it means instead of storing
data row-wise it split the table into columnar chunks called Micro-partitions. Why micro
because it only limits each partition to be 50 to 500 MB.
Snowflake doesn’t support indexing instead it manages the metadata of each micro-partition
to retrieve data faster. A relational database when queried uses indexes to traverse all the
rows to find requested data. The overhead of reading all the unused data causes the data
retrieval time consuming and compute-heavy. Contrary to relational DB, snowflake uses the
metadata of MP and checks which chunk or MP contains the data requested by the user.
Metadata content the offset and the number of rows consist in that particular micro partition.
Using the metadata, snowflake manages all micro-partitions for data storage and retrieval.
Check the snowflake doc on micro-partitions.
20. By default, clustering keys are created for every table, how can you disable
this option
When new data continuously arrived and loaded into micro-partitions some columns (for
example, event_date) have constant values in all partitions (naturally clustered), while
other columns (for example, city) may have the same values appearing over and over in all
partitions.
Snowflake allows you to define clustering keys, one or more columns that are used to co-
locate the data in the table in the same micro-partitions.
To suspend Automatic Clustering for a table, use the ALTER TABLE command with
a SUSPEND RECLUSTER clause. For example:
alter table t1 suspend recluster;
To resume Automatic Clustering for a clustered table, use the ALTER TABLE command with
a RESUME RECLUSTER clause. For example:
alter table t1 resume recluster;
21. What is the default type of table created in the Snowflake.
In addition to permanent tables, which is the default table type when creating tables,
Snowflake supports defining tables as either temporary or transient. These types of tables are
especially useful for storing data that does not need to be maintained for extended periods of
time (i.e. transitory data).
22. How many servers are present in X-Large Warehouse
23. As Snowflake should use one of the cloud providers (like AWS or Azure) as
part of its architecture, why can’t the AWS database Amazon Redshift can be
used instead of the Snowflake warehouse.
24. What view types can be created in Snowflake but not in traditional
databases:
Likewise tables in snowflake there are different types of views that can be created snowflake
i.e normal Views, Secure Views, and Materialized Views.
Normal views are similar to the views found in RDBMS where the output data depends on the
query it will run on a table or multiple tables. The query needs to be refreshed in order to
reflect the updated data.
Secure Views prevent users from possibly being exposed to data from rows of tables that are
filtered by the view. With secure Views, the view definition and details are only visible to
authorized users (i.e. users who are granted the role that owns the View).
A materialized view is a pre-computed dataset derived from a query specification which is
nothing but a SELECT query in its definition. The output is stored for later use.
Since the underlying data of the given query is pre-computed, querying a materialized view is
faster than executing the original query. This performance difference can be significant when
a query is run frequently or it is too complex.
25. Is Snowflake a Data Lake
A data lake is normally used for dumping all kinds of data coming from various data sources
where it can contain text data, chats, files, images, or videos. The data will be unfiltered,
unorganized, and difficult to analyze.
We cannot use this data to carry any information out of it.
On a similar basis, the snowflake is supporting structured and semi-structured data with
scalable cloud storage providing data lake features along with analytical usage of the data.
By choosing snowflake you get the best of both data lake and data warehouse.
26. What are the key benefits you have noticed after migrating to Snowflake
from a traditional on-premise database.
1. Cloud agnostic.
2. Decoupled storage and compute.
3. Highly scalable.
4. Query performance.
5. supports structured and semi-structured data.
6. Native connectors such as python, scala , R, and JDBC/ODBC.
7. Secure data sharing.
8. Materialized views.
27. When you execute a query, how does Snowflake retrieves the data as
compared to the traditional databases.
1. When end user execute any query, it first goes to cloud service layer where it get
optimized and restructured for better performance. The query will be tuned in
terms of getting data from the underlying data storage. Also the query gets
compiled by query compiler in same layer.
2. after compilation, it goes to metadata cache to check if the cache has stored any
data related to that query
28. Explain the difference between External Stages and Internal Name Stages:
Stages denotes where you want to stage (hold) the data in snowflake.
There are two types of stages exists in snowflake :
1. Internal stage :
In this stage , snowflake provide place to hold the data within itself. Data never leave
snowflake VPC in this kind of stage.
its also gets divided into sub categories as :
1. User : Each user get automatically allocated stage for data loading
2. Table : Each table get automatically allocated stage for data loading
3. Named : Named stages can be created manually for data loading.
2. External stage:
In opposite to Internal stage, external stages points to locations outsides on Snowflake. i.e.
Cloud storage buckets ( S3, GCS, Azure blob )
You must specify an internal stage in the PUT command when uploading files to Snowflake.
You must specify the same stage in the COPY INTO <table> command when loading data
into a table from the staged files.
29. Explain the difference between User and Table Stages.
User stages:
Each user has a Snowflake stage allocated to them by default for storing files. This stage is a
convenient option if your files will only be accessed by a single user, but need to be copied into
multiple tables.
User stages have the following characteristics and limitations:
User stages are referenced using @~; e.g. use LIST @~ to list the files in a user
stage.
Unlike named stages, user stages cannot be altered or dropped.
User stages do not support setting file format options. Instead, you must specify
file format and copy options as part of the COPY INTO <table> command.
This option is not appropriate if:
Multiple users require access to the files.
The current user does not have INSERT privileges on the tables the data will be
loaded into.
Table stage:
Each table has a Snowflake stage allocated to it by default for storing files. This stage is a
convenient option if your files need to be accessible to multiple users and only need to be
copied into a single table.
Table stages have the following characteristics and limitations:
Table stages have the same name as the table; e.g. a table named mytable has a
stage referenced as @%mytable.
Unlike named stages, table stages cannot be altered or dropped.
Table stages do not support transforming data while loading it (i.e. using a query
as the source for the COPY command).
Note that a table stage is not a separate database object; rather, it is an implicit stage tied to
the table itself. A table stage has no grantable privileges of its own. To stage files to a table
stage, list the files, query them on the stage, or drop them, you must be the table owner (have
the role with the OWNERSHIP privilege on the table).
30. What are the constraints which are enforced in Snowflake?
Normally there no constraints are enforced in snowflake except for NOT NULL constraints,
which are always enforced.
Usually, in traditional databases, there are many constraints being used to validate or restrict
the incorrect data from being stored such as primary key, not null, Unique, etc.
Snowflake provides the following constraint functionality:
Unique, primary, and foreign keys, and NOT NULL columns.
Named constraints.
Single-column and multi-column constraints.
Creation of constraints inline and out-of-line.
Support for creation, modification and deletion of constraints.
31. What is unique about Snowflake Vs Other Warehouses.
Please refer to question Q2.
32. How a snowflake is charging the customer?
Snowflake charges on pay
34. Do DDL commands cost you?
36. Difference between Snowflake and other databases?
37. how will you calculate the expense of query running in snowflake?
38. How to load files in Snowflake?
39. How to share a table in snowflake other than the data marketplace?
40. How does Snowflake stores data?
41. If I faced an error while loading data what will happen?
42. What is Snowpipe?
43. What is materialized view what are the drawbacks of it
44. How can you implement CDC in Snowflake?
45. What if one of the source tables added a few more columns how you will handle it at the
snowflake end.
46. How to load data from JSON to Snowflake?
47. What are secure views and why they are used? How is data privacy done here?
48. What is materialized view?
49. What are streams?
50. how you can fetch specific data from the variant columns?
51. How do you load semi-structured data in Snowflake?
52. How to create a stage in Snowflake?
53. What is clustering
54. What is automatic clustering
55. If I want to fetch data on basis of timestamp value is it feasible to cluster the data on
timestamp?
56. How will you read hierarchical JSON data, I mean in case it is having an array how would
you read that data.
57. How to disable fail-safe.
58. What is the best approach to recover the historical data at the earliest which was
accidentally deleted?
59. You have created a warehouse using the command create or replace warehouse
OriginalWH initially_suspended=true; What will be the size of the warehouse?
Scenario-Based Questions:
1. You have observed that a store procedure that is getting executed daily at 7 AM as part of
your batch process is consuming resources and the CPU I/O is showing as 90%, and the other
jobs which are getting executed are impacted due to the store procedure. How can you quickly
resolve the issue with the store procedure?
2. Some queries are getting executed on a warehouse and you have executed Alter Warehouse
statement to resize the warehouse, how this will affect the queries which are already in the
execution state.
3. A new business analyst has joined your project, as part of the onboarding process you have
sent him some queries to generate some reports, the query took around 5 minutes to get
executed, the same query, when executed by other business analysts, has returned the results
immediately? What could be the Issue?
Data Build Tool (DBT) Interview Questions and Answers
nishad patkar
·
Follow
11 min read
·
Dec 21, 2023
10
Data Build Tool (DBT) is a popular open-source tool used in the data analytics and data
engineering fields. DBT helps data professionals transform, model, and prepare data for
analysis. If you’re preparing for an interview related to DBT, it’s important to be well-versed
in its concepts and functionalities. To help you prepare, here’s a list of common interview
questions and answers about DBT.
1. What is DBT?
Answer: DBT, short for Data Build Tool, is an open-source data transformation and modeling
tool. It helps analysts and data engineers manage the transformation and preparation of data
for analytics and reporting.
2. What are the primary use cases of DBT?
Answer:DBT is primarily used for data transformation, modeling, and preparing data for
analysis and reporting. It is commonly used in data warehouses to create and maintain data
pipelines.
3. How does DBT differ from traditional ETL tools?
Answer: Unlike traditional ETL tools, DBT focuses on transforming and modeling data within
the data warehouse itself, making it more suitable for ELT (Extract, Load, Transform)
workflows. DBT leverages the power and scalability of modern data warehouses and allows
for version control and testing of data models.
4. What is a DBT model?
Answer: A DBT model is a SQL file that defines a transformation or a table within the data
warehouse. Models can be simple SQL queries or complex transformations that create derived
datasets.
5. Explain the difference between source and model in DBT.
Answer: A source in DBT refers to the raw or untransformed data that is ingested into the
data warehouse. Models are the transformed and structured datasets created using DBT to
support analytics.
6. What is a DBT project?
Answer: A DBT project is a directory containing all the files and configurations necessary to
define data models, tests, and documentation. It is the primary unit of organization for DBT.
7. What is a DAG in the context of DBT?
Answer: DAG stands for Directed Acyclic Graph, and in the context of DBT, it represents the
dependencies between models. DBT uses a DAG to determine the order in which models are
built.
8. How do you write a DBT model to transform data?
Answer: To write a DBT model, you create a `.sql` file in the appropriate project directory,
defining the SQL transformation necessary to generate the target dataset.
9. What are DBT macros, and how are they useful in transformations?
Answer: DBT macros are reusable SQL code snippets that can simplify and standardize
common operations in your DBT models, such as filtering, aggregating, or renaming columns.
10. How can you perform testing and validation of DBT models?
Answer: You can perform testing in DBT by writing custom SQL tests to validate your data
models. These tests can check for data quality, consistency, and other criteria to ensure your
models are correct.
11. Explain the process of deploying DBT models to production.
Answer: Deploying DBT models to production typically involves using DBT Cloud, CI/CD
pipelines, or other orchestration tools. You’ll need to compile and build the models and then
deploy them to your data warehouse environment.
12. How does DBT support version control and collaboration?
Answer: DBT integrates with version control systems like Git, allowing teams to collaborate
on DBT projects and track changes to models over time. It provides a clear history of changes
and enables collaboration in a multi-user environment.
13. What are some common performance optimization techniques for DBT models?
Answer: Performance optimization in DBT can be achieved by using techniques like
materialized views, optimizing SQL queries, and using caching to reduce query execution
times.
14. How do you monitor and troubleshoot issues in DBT?
Answer: DBT provides logs and diagnostics to help monitor and troubleshoot issues. You can
also use data warehouse-specific monitoring tools to identify and address performance
problems.
15. Can DBT work with different data sources and data warehouses?
Answer: Yes, DBT supports integration with a variety of data sources and data warehouses,
including Snowflake, BigQuery, Redshift, and more. It’s adaptable to different cloud and on-
premises environments.
16. How does DBT handle incremental loading of data from source systems?
Answer: DBT can handle incremental loading by using source freshness checks and managing
data updates from source systems. It can be configured to only transform new or changed
data.
17. What security measures does DBT support for data access and transformation?
Answer: DBT supports the security features provided by your data warehouse, such as row-
level security and access control policies. It’s important to implement proper access controls
at the database level.
18. How can you manage sensitive data in DBT models?
Answer: Sensitive data in DBT models should be handled according to your organization’s
data security policies. This can involve encryption, tokenization, or other data protection
measures.
19. Types of Materialization?
Answer: DBT supports several types of materialization are as follows:
1)View (Default):
Purpose: Views are virtual tables that are not materialized. They are essentially saved
queries that are executed at runtime.
Use Case: Useful for simple transformations or when you want to reference a SQL query in
multiple models.
{{ config(
materialized='view'
) }}
SELECT
...
FROM ...
2)Table:
Purpose: Materializes the result of a SQL query as a physical table in your data warehouse.
Use Case: Suitable for intermediate or final tables that you want to persist in your data
warehouse.
{{ config(
materialized='table'
) }}
SELECT
...
INTO {{ ref('my_table') }}
FROM ...
3)Incremental:
Purpose: Materializes the result of a SQL query as a physical table, but is designed to be
updated incrementally. It’s typically used for incremental data loads.
Use Case: Ideal for situations where you want to update your table with only the new or
changed data since the last run.
{{ config(
materialized='incremental'
) }}
SELECT
...
FROM ...
{{ config(
materialized='table',
unique_key='id'
) }}
SELECT
...
INTO {{ ref('my_table') }}
FROM ...
5)Snapshot:
Purpose: Materializes a table in a way that retains a version history of the data, allowing you
to query the data as it was at different points in time.
Use Case: Useful for slowly changing dimensions or situations where historical data is
important.
{{ config(
materialized='snapshot'
) }}
SELECT
...
INTO {{ ref('my_snapshot_table') }}
FROM ...
version: 2
models:
- name: my_model
tests:
- unique:
columns: [id]
version: 2
models:
- name: my_model
tests:
- not_null:
columns: [name, age]
version: 2
models:
- name: my_model
tests:
- accepted_values:
column: status
values: ['active', 'inactive']
version: 2
models:
- name: orders
tests:
- relationship:
to: ref('customers')
field: customer_id
version: 2
models:
- name: orders
tests:
- referential_integrity:
to: ref('customers')
field: customer_id
version: 2
models:
- name: my_model
tests:
- custom_sql: "column_name > 0"
21.What is seed?
Answer: A “seed” refers to a type of dbt model that represents a table or view containing static
or reference data. Seeds are typically used to store data that doesn’t change often and doesn’t
require transformation during the ETL (Extract, Transform, Load) process.
Here are some key points about seeds in dbt:
1. Static Data: Seeds are used for static or reference data that doesn’t change
frequently. Examples include lookup tables, reference data, or any data that
serves as a fixed input for analysis.
2. Initial Data Load: Seeds are often used to load initial data into a data
warehouse or data mart. This data is typically loaded once and then used as a
stable reference for reporting and analysis.
3. YAML Configuration: In dbt, a seed is defined in a YAML file where you
specify the source of the data and the destination table or view in your data
warehouse. The YAML file also includes configurations for how the data should
be loaded.
Here’s an example of a dbt seed YAML file:
version: 2
sources:
- name: my_seed_data
tables:
- name: my_seed_table
seed:
freshness: { warn_after: '7 days', error_after: '14 days' }
-- models/my_model.sql
{{ config(
pre_hook = "CREATE TEMP TABLE my_temp_table AS SELECT * FROM
my_source_table"
) }}
SELECT
column1,
column2
FROM
my_temp_table
2)Post-hooks:
A post-hook is a SQL command or script that is executed after the successful
completion of dbt models.
It allows you to perform cleanup tasks, log information, or execute additional
SQL commands after the models have been successfully executed.
Common use cases for post-hooks include tasks such as updating metadata
tables, logging information about the run, or deleting temporary tables created
during the pre-hook.
Example of a post-hook :
-- models/my_model.sql
SELECT
column1,
column2
FROM
my_source_table
{{ config(
post_hook = "UPDATE metadata_table SET last_run_timestamp =
CURRENT_TIMESTAMP"
) }}
23.what is snapshots?
Answer: “snapshots” refer to a type of dbt model that is used to track changes over time in a
table or view. Snapshots are particularly useful for building historical reporting or analytics,
where you want to analyze how data has changed over different points in time.
Here’s how snapshots work in dbt:
1. Snapshot Tables: A snapshot table is a table that represents a historical state of
another table. For example, if you have a table representing customer
information, a snapshot table could be used to capture changes to that
information over time.
2. Unique Identifiers: To track changes over time, dbt relies on unique identifiers
(primary keys) in the underlying data. These identifiers are used to determine
which rows have changed, and dbt creates new records in the snapshot table
accordingly.
3. Timestamps: Snapshots also use timestamp columns to determine when each
historical version of a record was valid. This allows you to query the data as it
existed at a specific point in time.
4. Configuring Snapshots: In dbt, you configure snapshots in your project by
creating a separate SQL file for each snapshot table. This file defines the base
table or view you’re snapshotting, the primary key, and any other necessary
configurations.
Here’s a simplified example:
-- snapshots/customer_snapshot.sql
{{ config(
materialized='snapshot',
unique_key='customer_id',
target_database='analytics',
target_schema='snapshots',
strategy='timestamp'
) }}
SELECT
customer_id,
name,
email,
address,
current_timestamp() as snapshot_timestamp
FROM
source.customer;
24.What is macros?
Answer: macros refer to reusable blocks of SQL code that can be defined and invoked within
dbt models. dbt macros are similar to functions or procedures in other programming
languages, allowing you to encapsulate and reuse SQL logic across multiple queries.
Here’s how dbt macros work:
1. Definition: A macro is defined in a separate file with a .sql extension. It
contains SQL code that can take parameters, making it flexible and reusable.
-- my_macro.sql
{% macro my_macro(parameter1, parameter2) %}
SELECT
column1,
column2
FROM
my_table
WHERE
condition1 = {{ parameter1 }}
AND condition2 = {{ parameter2 }}
{% endmacro %}
2. Invocation: You can then use the macro in your dbt models by referencing it.
-- my_model.sql
{{ my_macro(parameter1=1, parameter2='value') }}
When you run the dbt project, dbt replaces the macro invocation with the actual SQL code
defined in the macro.
3. Parameters: Macros can accept parameters, making them dynamic and reusable for
different scenarios. In the example above, parameter1 and parameter2 are parameters
that can be supplied when invoking the macro.
4. Code Organization: Macros help in organizing and modularizing your SQL code. They
are particularly useful when you have common patterns or calculations that need to be
repeated across multiple models.
-- my_model.sql
{{ my_macro(parameter1=1, parameter2='value') }}
-- another_model.sql
{{ my_macro(parameter1=2, parameter2='another_value') }}
my_project/
|-- analysis/
| |-- my_analysis_file.sql
|-- data/
| |-- my_model_file.sql
|-- macros/
| |-- my_macro_file.sql
|-- models/
| |-- my_model_file.sql
|-- snapshots/
| |-- my_snapshot_file.sql
|-- tests/
| |-- my_test_file.sql
|-- dbt_project.yml
10
3 Followers
Follow
More from nishad patkar
nishad patkar
21
It is important to provide clear and concise responses during job interviews with any
company.
What is a snowflake?
Snowflake is an analytical data warehouse that operates on the cloud and offers software as a
service (SaaS).
Why is Snowflake, not any other warehouse? Or What is the advanced feature of
Snowflake?
Below are fetchers available in Snowflake.
(1) Snowflake three-layer architected(cloud layer, query processing layer, storage
layer).storage and compute layer is decoupled(2)Auto scaling (3)Time travel (4) Zero copy
clone (5)Data sharing (6)Multi-language support(7) Task and strim (8) snow pipe
(9)snowpark, etc.
What type of user role is available in Snowflake?
Below are six roles available in Snowflake.
Account admin -The account admin can manage all aspects of the account.
Orgadmin- The organization administrator can manage the organization and account in the
organization.
Public -Public role is automatically available to every user in an account.
Securityadmin-Security administrator can manage the security aspects of the account.
Sysadmin-System administrator can create and manage database and warehouse.
User admin-user administrator can create and manage user and role
How to validate a file prior to loading it into a target table in Snowflake?
Before loading your data, you can validate that the data in the uploaded files will load
correctly. Execute COPY INTO <table> in validation mode
using the VALIDATION_MODE parameter. The VALIDATION_MODE parameter returns
any errors that it encounters in a file.
You can then modify the data in the file to ensure it loads without error.
Type of table support Snowflake?
Snowflake offers three types of tables namely, Temporary, Transient & Permanent. Default is
Permanent:
Temporary tables:
Only exist within the session in which they were created and persist only for the remainder of
the session.
They are not visible to other users or sessions and do not support some standard features
such as cloning.
Once the session ends, data stored in the table is purged completely from the system and,
therefore, is not recoverable, either by the user who created the table or Snowflake.
Transient tables:
Persist until explicitly dropped and are available to all users with the appropriate privileges.
Specifically designed for transitory data that needs to be maintained beyond each session (in
contrast to temporary tables)
Permanent Tables (DEFAULT):
Similar to transient tables the key difference is that they do have a Fail-safe period, which
provides an additional level of data protection and recovery.
What is an External Table in Snowflake?
Snowflake External Tables provide a unique way of accessing the data from files in external
locations(i.e. S3, Azure, or GCS) without actually moving them into Snowflake. They enable
you to query data stored in files in an external stage as if it were inside a database by storing
the file-level metadata.
Type of Snowflake edition?
There are four types of snowflake editions available.
1-Standard Edition
2-Enterprise Edition
3-Business Critical Edition
4-Virtual Private Snowflake (VPS)
What types of stage tables are available in Snowflake?
Snowflake supports two different types of data stages: external stages and internal stages. An
external stage is used to move data from external sources, such as (S3, Azure, or GCS),
buckets, to internal Snowflake tables. On the other hand, an internal stage is used as an
intermediate storage location for data files before they are loaded into a table or after they are
unloaded from a table.
internal stage:-(I)User stage
(II) Table Stage
(III) Named Stage
What is the data retention time in Snowflake?
The standard retention period is 1 day (24 hours) and is automatically enabled for all
Snowflake accounts: For Snowflake Standard Edition, the retention period can be set to 0 (or
unset back to the default of 1 day) at the account and object level (i.e. databases, schemas, and
tables).
Can you explain the concept of a snowflake three-layer architecture?
Database Storage:
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal
optimized, compressed, columnar format. Snowflake stores this optimized data in cloud
storage. Snowflake manages all aspects of how this data is stored — the organization, file size,
structure, compression, metadata, statistics, and other aspects of data storage are handled by
Snowflake.
Query Processing:
Query execution is performed in the processing layer. Snowflake processes queries using
“virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of
multiple compute nodes allocated by Snowflake from a cloud provider. Each virtual
warehouse is an independent compute cluster that does not share compute resources with
other virtual warehouses. As a result, each virtual warehouse has no impact on the
performance of other virtual warehouses.
Cloud Services:
The cloud services layer is a collection of services that coordinate activities across Snowflake.
Services managed in this layer include:
1-Authentication
2-Infrastructure management
3-Metadata management
4-Query parsing and optimization
5-Access control
What is a clone in Snowflake or what is a zero-copy clone in Snowflake?
The most powerful feature of Zero Copy Cloning is that the cloned and original objects(Table,
schema, database) are independent of each other, any changes done on either of the objects
do not impact others. Until you make any changes, the cloned object shares the same storage
as the original. This can be quite useful for quickly producing backups that don’t cost anything
extra until the copied object is changed.
Can you explain what time travel means in Snowflake?
Snowflake Time Travel enables accessing historical data (i.e. data that has been changed or
deleted) at any point within a defined period.
Can you explain the concept of fail-safe in Snowflake?
Fail-safe protects historical data in case there is a system failure or any other failure. Fail-safe
allows 7 days in which your historical data can be recovered by Snowflake and it begins after
the Time Travel retention period ends. Snowflake support team handles this issue.
How to read data from staging table in JSON file.
We are able to read JSON files from the stage layer using the function lateral
flatten.FLATTEN is a table function that takes a VARIANT, OBJECT, or ARRAY column and
produces a lateral view (i.e. an inline view that contains correlation referring to other tables
that precede it in the FROM clause). FLATTEN can be used to convert semi-structured data to
a relational representation
I will update you shortly on your other question………
Snowflake Cost Optimization Series
Pooja Kelgaonkar
·
Follow
2 min read
·
Oct 18, 2023
12
Hello All! Thanks for reading my earlier Snowflake blog series on Data Governance and
Architecting data workloads with Snowflake. If you havent read the earlier blogs then you can
subscribe to my medium and read earlier blogs in the series.
I am glad to start a new series on Snowflake cost optimization. This series consist of below
topics and help you to learn more about the costing model, optimization techniques etc.
1. Understanding Snowflake Cost
2. Warehouse Cost Optimization
3. Understanding Snowflake Optimization services
4. Implementing Snowflake’s Cost Optimization techniques
5. Implement Cost Monitoring & Alerting using Snowsight
Snowflake’s costing model is very transparent and easy to understand. You will be charged
only for the services being used and only for the time they are up and running. You need to
understand Snowflake’s architecture — three layered architecture and their features to
understand the costing model well. You can refer to my earlier blogs to understand some of
the foundational concepts, features required for the cost series.
Snowflake Architecture
1. Read — https://medium.com/@poojakelgaonkar/snowflake-data-on-cloud-
cf3898fee3d0 to understand Snowflake architecture components and their
features, usage.
2. Read — https://medium.com/snowflake/building-dashboards-using-snowsight-
daacf4bd42a8 to understand Snowsight and dashboarding feature.
3. Read — https://medium.com/@poojakelgaonkar/setting-up-alerting-for-snowflake-data-
platform-8b67863eeb07 to know more about Snowflake’s alerting feature.
This series will help you to understand the Snowflake’s costing model, optimization
techniques, services that can be used to improve the performance optimizations, costing of
serverless services, designing and implementing appropriate cost monitoring and alerts to
avoid any unprecdicted cost to the account.
Please follow my blog series on Snowflake topics to understand various aspects of data design,
engineering, data governance, cost optimizations etc.
About Me :
I am one of the Snowflake Data Superheroes 2023. I am also one of the Snowflake SnowPro
Core SME- Certification Program. I am a DWBI and Cloud Architect! I am currently
working as Senior Data Architect — GCP, Snowflake. I have been working with various
Legacy data warehouses, Bigdata Implementations, and Cloud platforms/Migrations. I am
SnowPro Core certified Data Architect as well as Google certified Google Professional Cloud
Architect. You can reach out to me LinkedIn if you need any further help on certification,
Data Solutions, and Implementations!
Object Tagging with Snowflake
Pooja Kelgaonkar
·
Follow
Published in
Snowflake
·
3 min read
·
Aug 18, 2023
11
1
Thanks for reading my earlier blog in the Data Governance series. In case you missed reading
the blog, you can refer to it here — https://medium.com/snowflake/snowflake-dynamic-data-
masking-4ef7b53b414e
This blog helps you understand tagging in Snowflake. This also covers tagging details — how
you can create them, use them, assign to the database objects, and track them for usage.
What is Tagging?
Tagging is the process to define a tag for a database object. This tag can be used to identify the
object and use it to implement data classification, data protection, compliance, and usage.
This can be used in centralized as well as de-centralized approaches to implementation.
Centralized implementation follows the Snowflake recommendation, Role setup, and Access
control policies setup, follows the organizational hierarchy, and flows down the rules, and
policies to the database objects as per hierarchy.
What is Tag?
A tag is a schema-level object that can be defined and assigned to one or more different types
of objects. You can assign a string value to a Tag. A tag can be assigned to a table, views,
columns as well as a warehouse. Snowflake limits the number of tags in an account to 10,000.
You can assign multiple tags to an object. This is an enterprise feature.
How to create Tag?
You can create a tag using CREATE statement.
CREATE TAG cost_center COMMENT = ‘cost_center tag’;
How to assign Tag to an object?
You can use tags while creating objects or you can also assign them using ALTER command.
CREATE WAREHOUSE DEV_WH WITH TAG (cost_center = 'DEV');
ALTER WAREHOUSE QA_WH SET TAG cost_center = ‘QA’;
How object Tag works? What is the hierarchy of Tags?
Tag follows the Snowflake object hierarchy, if you create a tag at a table level then it also gets
applied to the columns.
What are the benefits of Tag?
Below are some of the benefits —
1. Ease of use — Define once and apply it to multiple objects.
2. Tag Lineage — Since tags are inherited, applying the tag to objects higher in the
securable objects hierarchy results in the tag being applied to all child objects. For
example, if a tag is set on a table, the tag will be inherited by all columns in that
table.
3. Sensitive data tracking — Tags simplify identifying sensitive data (e.g. PII, Secret)
and bring visibility to Snowflake resource usage.
4. Easy Resource Tracking — With data and metadata in the same system, analysts
can quickly determine which resources consume the most Snowflake credits
based on the tag definition (e.g. cost_center, department).
5. Centralized or De-centralized Data Management —
Tags support different management approaches to facilitate compliance with
internal and external regulatory requirements.
How to DISCOVER Tags?
You can use SELECT to list the tags-
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAGS ORDER BY TAG_NAME;
What all commands can be used for tags?
You can use all DDL commands for tags — CREATE, ALTER, DROP, UNDROP, SHOW. You
can use these to create, alter, drop, or restore as well as list down the tag details from a
database and schema.
Sample Use case to tag PII data —
1. Create a tag to list all sensitive data —
CREATE TAG sensitive_data COMMENT = ‘PII data tag’;
2. Assign a tag to objects to identify PII data —
USE DATABASE POC_DEV_DB;
USE SCHEMA POC;
ALTER TABLE employee_info modify column employee_ssn SET TAG sensitive_data =
‘Employee data’
Hope this blog helps you to understand object tagging. Tagging helps to define the tag to
objects, used to capture usage and data classification.
About Me :
I am one of the Snowflake Data Superheroes 2023. I am also one of the Snowflake SnowPro
Core SME- Certification Program. I am a DWBI and Cloud Architect! I am currently working
as Senior Data Architect — GCP, Snowflake. I have been working with various Legacy data
warehouses, Bigdata Implementations, and Cloud platforms/Migrations. I am SnowPro Core
certified Data Architect as well as Google certified Google Professional Cloud Architect. You
can reach out to me LinkedIn if you need any further help on certification, Data Solutions,
and Implementations!
Snowflake
Snowflake Tables
9 Snowflake Tables- A briefing(as of 2023)
Somen Swain
·
Follow
Published in
Snowflake
·
12 min read
·
Dec 20, 2023
98
1
In this blog I would be discussing about various Snowflake tables and also some of the use
cases which each of these tables can solve. Post reading this blog, I hope it should give some
insights around each of the table type, how these are significant w.r.t multiple data workloads,
where they can be used, etc.
Overall there are 9 different kind of table Snowflake has that are namely given as follows:
1. Dynamic table
2. Directory table
3. Event table
4. External table
5. Hybrid table
6. Iceberg table
7. Permanent table
8. Temporary table
9. Transient table
Let us go though each one of them as mentioned below:
DYNAMIC TABLES
Dynamic tables, are fundamental units of declarative data transformation pipelines are
dynamic tables. They substantially reduce the complexity of data engineering in Snowflake
and offer an automated, dependable, and economical method of preparing your data for use.
A dynamic table allows you to specify a query and has its results materialized. You can define
the target table as a dynamic table and specify the SQL statement that performs the
transformation, saving you the trouble of creating a separate target table and writing code to
transform and update the data in that table. This feature was renamed as Snowflake Dynamic
Tables and is now available for all accounts. It was first introduced as “Materialized
Tables” at the Snowflake Summit 2022, a name that caused some confusion.
-- The output has the exact file URL where we do have our data stored.
There are multiple ways through which the log level can be set the above example has the
value as “INFO”, we can also have values like ‘trace’, ‘debug’, ‘info’, ‘warn’, ‘error’. Hence
essentially to make the event table. A simple demo of the event table is given as below:
-- This use case is to track just the INSERT operations counts. Like
how many records got inserted
call DEMO_EVENT_TBL_PROC('A');
select * from DEMO_DB.DEMO_SCHEMA.EVENT_TBL_V1;
SELECT * FROM DEMO_DB.DEMO_SCHEMA.DEMO_CUSTOMER LIMIT 10;
select * from DEMO_DB.DEMO_SCHEMA.BKP_CUSTOMER;
HYBRID TABLES
Hybrid tables are a new Snowflake table type powering Unistore. A key design principle is
to have this table support all the transactional capabilities need. These are highly performant
which is a need of any transactional application & support fast single row
operations. They work on entirely new row-based storage engine. This is unlike other
tables in Snowflake where data is stored in columnar way. Currently these are still in PrPr
stage within Snowflake but once it is made available for general use it has got immense
potential to unlock multitude of OLTP use cases.
-- Create order table with foreign key referencing the customer table
CREATE OR REPLACE HYBRID TABLE Orders (
Orderkey number(38,0) PRIMARY KEY,
Customerkey number(38,0),
Orderstatus varchar(20),
Totalprice number(38,0),
Orderdate timestamp_ntz,
Clerk varchar(50),
CONSTRAINT fk_o_customerkey FOREIGN KEY (Customerkey) REFERENCES
Customers(Customerkey),
INDEX index_o_orderdate (Orderdate)); -- secondary index to accelerate
time-based lookups
More reads:
https://www.snowflake.com/blog/build-open-data-lakehouse-iceberg-tables/
https://docs.snowflake.com/en/user-guide/tables-iceberg
PERMANENT, TEMPORARY & TRANSIENT TABLES
Lastly, let us throw some insights around most widely used tables i.e., Permanent, Temporary
& Transient tables. These tables have been there since years now and hence is used by almost
every other individual who has been associated with Snowflake.
What is Permanent table?
The typical, everyday database tables are the “Permanent Tables”. Snowflake’s default
table type is permanent, and making one doesn’t require any extra syntax during creation.
The information kept in permanent tables takes up space and gets added to the storage fees
that Snowflake charges you.
In addition, it has extra features like Fail-Safe and Time-Travel that aid in data availability
and recovery.
--The syntax for creating the permanent table is given as below:
create table student (id number, name varchar(100));
The Comparison
The differences between the three table types are outlined in the table below, with special
attention to how they affect fail-safe and time travel:
SUMMARY:
There are a number of Snowflake tables and this blog is written to give insights around each
one of them and what each table does. Using the right kind of table for each of the scenarios
we handle shall bring the best out of this platform.
Please keep reading my blogs it is only going to encourage me in posting more
such content. You can find me on LinkedIn by clicking here and on Medium here.
Happy Learning :)
Awarded as “Data Superhero by Snowflake for year 2023”, click here for more
details.
Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view
of my current, former, or future employers.
Snowflake
Cristian Scutaru
·
Follow
8 min read
·
Nov 10, 2023
26
1
I remember last February, when they announced the availability of Snowflake Scripting.
Oracle had PL/SQL, Microsoft SQL Server had Transact-SQL, it was about time for Snowflake
to come up with their own procedural SQL language. All in the general effort of bringing data
processing closer to where the data is. And making this data processing less verbose and more
intuitive.
It was very exciting for me, as I just had to design and implement — for a large client — a huge
generic Snowflake data pipeline with …heavy stored procedures in JavaScript (the only
available alternative at that time).
However, this is the first time since that I had a few hours for a rather deep dive. And, despite
the obvious utility of this API, …
…I have strong doubts at this moment that Snowflake Scripting is mature
enough for production.
Simple Requirements
My need was for a complex generic query encapsulated in a secure tabular function, to be
exposed through a secure data share. To keep it simple here, let’s illustrate this with a very
simple UNION query, that basically duplicates the rows of any table, when you pass the table
name as argument. Here is my very simple test data:
create or replace database script_test;
use schema script_test.public;
What I want is a simple UDTF to return duplicate rows on ANY table. This is the hard-coded
version for just the CUSTOMERS table:
I know, I know, I checked the doc as well, and it clearly says (at the present moment at least),
that you can write stored procs with Snowflake Scripting, but only SQL (with simple
expressions) is allowed for UDFs and UDTFs.
But why are they not capable to detect that I pass a code block here (starting with BEGIN or
DECLARE) and simply tell me something like “Snowflake Scripting cannot be used for
UDFs!”. The messages I got were:
Syntax error:L compilation error: (line 54)
syntax error line 2 at position 4 unexpected ‘LET’.
syntax error line 3 at position 17 unexpected ‘from’.
syntax error line 3 at position 33 unexpected ‘?’.
syntax error line 5 at position 17 unexpected ‘from’.
syntax error line 5 at position 33 unexpected ‘?’. (line 54)
Stored Proc in Scripting? LANGUAGE not optional!
Let’s write it as a stored proc then, skipping the LANGUAGE header. They clearly say: “Note
that this is optional for stored procedures written with Snowflake Scripting”!
create or replace procedure dup_any_sp(table_name varchar)
returns table()
as
begin
LET c1 CURSOR FOR
select * from identifier(?)
union all
select * from identifier(?);
OPEN c1 USING (:table_name, :table_name);
RETURN TABLE(RESULTSET_FROM_CURSOR(c1));
end;
call dup_any_sp('customers');
-- select * from table(result_scan(last_query_id()));
call dup_any_sp('products');
A CALL in a code block can use the INTO clause to pass the returned result (yes, stored procs
return results, just like functions!) into a local variable. While the result of an outside CALL,
from a SQL worksheet, is only dumped on screen in Snowsight.
Alternative with DECLARE
We could initialize the cursor in a DECLARE block as well:
create or replace procedure dup_any_sp10(table_name varchar)
returns table()
language sql
as
declare
c1 CURSOR FOR
select * from identifier(?)
union all
select * from identifier(?);
begin
OPEN c1 USING (:table_name, :table_name);
RETURN TABLE(RESULTSET_FROM_CURSOR(c1));
end;
call dup_any_sp10('customers');
call dup_any_sp2('customers');
You can skip the RESULTSET declaration and get the equivalent short version (thanks, Tom
Meacham!):
call dup_any_sp22('customers');
It’s interesting that you cannot pass the SQL statement directly to RETURN TABLE, you have
to pass it always through an intermediate RESULTSET.
But any inline SQL statement — with no assignment at all — will also get executed right away,
as it happens for the SQL-based procedures or functions. However, for Scripting there will be
no result returned, at all.
Alternative with EXECUTE IMMEDIATE
One other way to do it — not necessarily better — is to build the whole SQL statement as a
string.
Remark how we don’t need the : notation here. It is only required for local variables or
function parameters when you insert them into inline SQL statements. But to me it is rather
confusing when we use these variables with or without prefix. Especially because we do not
always get clear error or warning messages when we make some expected expected mistakes.
call dup_any_sp3('customers');
Try to remove the RESULTSET data type from the LET r1 command, assuming the type will
be inferred. It is not! With a much simpler use case you get a proper warning (“Variable ‘R1’
cannot have its type inferred from initializer”). But in this slightly more complicated use case
we get a very confusing and misplaced “unexpected ‘immediate’” error.
The following immediate alternative will separate and first declare the local variables in a
DECLARE block:
call dup_any_sp31('customers');
It works, but what I found confusing — and I wasted some time on it — is yet again the lack of
clarity in the returned error messages, when you make a very honest mistake.
Try to leave the LET keywords as they were, and you get all sorts of error messages, either on
CREATE, or on CALL. But nothing will tell you that you should not use LET with variables
that you already declared before. Or, if you can do it, get at least a warning about it!
I repeated the experiment and I found indeed a simpler use case where they tell me this
(something like “Variable with name X declared twice.”). But this should get better
coverage…
Conclusions
Snowflake should make it very clear that UDFs and UDTFs are NOT supported
with Snowflake Scripting at this time. A CREATE FUNCTION should return a
much better error message when a statement starting with BLOCK or DECLARE
is detected. From the inconsistencies I’ve seen, I can only assume that the support
for UDFs with Scripting in coming. But in the meantime, we should not spend
hours figuring out what the errors are.
You either make everywhere LANGUAGE SQL required, or you make it optional
but you clearly know to detect immediately a SQL statement or a block of
Scripting code. If in doubt, send a clear error message with “Add a LANGUAGE
header, as we are not sure what the language is”.
There is frequent confusion where the : prefix notation is truly required or not,
for local script variables and function arguments. And yes, we know that it’s
required in inline SQL statements (not as literal strings!). But an universal usage
of any variable with the : prefix in a script block would make everything more
consistent and much easier to follow.
Many error messages for Snowflake Scripting are rather confusing and frequently
misplaced. It’s very hard to debug and figure out what was wrong even in a very
small block, with just 10–15 lines of code. If Snowflake Scripting is indeed a very
serious extension— as it’s supposed to be — improve the user experience as well.
There are many bugs, or confusing things that are not supposed to happen, as I
illustrated with just a few use cases before.
The doc for Snowflake Scripting is flooded everywhere with the alternative using
required $$ delimiters in SnowSQL and classic UI. Yes, it is a temporary
limitation, but you should find a better way to signal it, instead of repeating it
again and again for almost every example.
Felipe Hoffa
·
Follow
Published in
Snowflake
·
6 min read
·
Jan 10
102
Image generated by AI
Amping up Snowflake’s compute
Traditionally Snowflake has offered 2 easy ways of increasing compute power when dealing
with larger queries and concurrency:
Scale your session’s “virtual warehouse” to a larger size.
Set up “multi-cluster warehouses” that dynamically add more clusters to deal
with peaks of concurrent usage.
With these 2 basic elements, users are able to set up policies to control costs and divide
resources between different teams and workloads.
For example — for my Snowflake experiments I usually do everything on my own “Small
Warehouse”. This keeps costs low, and it’s usually pretty fast and predictable. I only need to
scale to larger warehouses when dealing with huge transformations and extracts, like the
example we are going to play with today.
Scaling a warehouse up to get faster results on slower queries is super easy, barely an
inconvenience. I can jump at any moment from a “Small WH” to an “Extra Large WH”, to a
“Medium WH”, to a “4X-Large WH”, etc. This is cool, but then the question becomes: “How
can I tell exactly what’s the best WH size for my upcoming queries?”
Instead of resizing warehouses, it would be really cool if I could run my whole session on a
“Small WH” (or an “Extra Small WH”) and then I could have Snowflake automatically
intercept my larger queries, and run them with way more resources in a “magic serverless”
way.
And that’s exactly what the new Query Acceleration Service does. Let’s test it out here (with a
Snowflake Enterprise Edition account).
Conversation and live demo featuring Query Acceleration Service
Extracting data from GitHub’s 17 Terabyte Archive
For this experiment we are going to look into GH Archive — a collection of all GitHub events.
I did a lot of experiments with it in my past life at Google, and now Cybersyn has made a copy
of the GH Archive on the Snowflake Marketplace.
To bring this dataset into your Snowflake account, just ask your Account Admin to import it
at no cost:
Importing the GH Archive into your Snowflake account
For more tips, check my post “Querying GitHub Archive with Snowflake: The Essentials”. In
the meantime let’s continue with a straightforward example.
Note in the above screenshot that I renamed the incoming database to GHARCHIVE for
cleaner querying.
Once we have GHARCHIVE in our account, we can see 3 tables — with the main one
being events:
select count(*)
from gharchive.cybersyn.github_events;
-- 1.4s
-- 6,966,010,260
That’s 7 billion rows of rich history — and a lot of data to deal with. The first step when
exploring datasets this large should be to extract a subset of rows with the data we are
interested in:
For example, this is the whole history of Apache Iceberg on GitHub:
This table extraction took only 96 seconds on a “2X Large WH”, but some long ~22 minutes
on my usual “Small WH”.
Can the Query Acceleration Service (QAS) help here? There’s a very easy way to tell:
select system$estimate_query_acceleration('01b191db-0603-f84f-002f-
a0030023f256');
And the response is “yes” (when using the query id from the ~22m run):
{
"queryUUID":"01b191db-0603-f84f-002f-a0030023f256",
"status":"eligible",
"originalQueryTime":1191.759,
"estimatedQueryTimes":{"1":608,"2":411,"4":254,"8":149,"31":55},
"upperLimitScaleFactor":31
}
Snowflake is telling us that the query took 1191s, and if we had let the QAS service help, it
could have taken between 608s and 55s — depending on the max scaling factor we would
allow it (in this case, up to 31).
To test QAS, I created a new WH. To make this test more dramatic, I made it an “Extra Small”
with unlimited scaling power:
If I use the “Extra Small WH with unlimited QAS”, Snowflake now automatically accelerates
this query
To check the cost of this QAS query that ran in 77s while I was working within a “Extra Small
WH” session, we can check the logs:
We can see that the query scanned 5 terabytes of data, for a total cost of 0.4 credits.
Depending on the region with an Enterprise Edition Snowflake account that should be around
$1.5 dollars.
In comparison:
Small WH, Enterprise edition, 1192s: $3*2*1192/3600 = $1.99 (+ time
between queries and auto-suspend)
2XL WH, Enterprise edition, 96s: $3*32*96/3600 = $2.56 (+ time between
queries and auto-suspend)
QAS 21x auto-acceleration, within a X-Small session, 77s: $3*0.5
= $1.5 (serverless model, no auto-suspend needed for the QAS queries — but the
XS session kept running for $3*1*67/3600=$0.06 extra)
This is the power of the Query Acceleration Service: When it works, we don’t need to worry
anymore about re-sizing warehouses, and we can let Snowflake take care of the huge queries
that need extra power.
Image generated by AI
Query Acceleration Caveats
QAS is Generally Available (GA) in Snowflake and ready for you to use.
However you will notice that it’s picky on which queries it decides to accelerate — and I expect
this set of supported queries to grow over time.
You can find a handy history of the queries in your account that could have been accelerated:
The docs also list what kind of queries are not eligible for acceleration (for now):
Some queries are ineligible for query acceleration. The following are
common reasons why a query cannot be accelerated:
For example this query could not get QAS during my tests:
But this one that produces the same results does — thanks to adding group by and order
by:
Next steps
Conversation and live demo featuring Query Acceleration Service
To go deeper analyzing GitHub, check my post “Querying GitHub Archive with
Snowflake: The Essentials”.
Check my conversation and live demo featuring Query Acceleration Service with
Product Manager
Tim Sander
.
Check out how many of your queries could have been accelerated with QAS in
your account.
Report results, and give us feedback on what else you’d like QAS to take care of.
Try the combination of QAS + Search Optimization.
Want more?
The One Billion Row Challenge with Snowflake
Sean Falconer
·
Follow
Published in
Snowflake
·
9 min read
·
Jan 9
158
3
I recently learned about the One Billion Row Challenge initiated by Gunnar Morling. The
challenge is to create a Java program to process a CSV file containing 1 billion rows, where
each row contains the name of a weather station and a recorded temperature. The program
needs to calculate the minimum, average, and maximum temperatures for each weather
station.
Originally tailored for Java, the challenge has attracted interest in other programming
languages, as well as approaches using databases and SQL. Initially, I considered using this
challenge as an opportunity to delve into Rust or Go, two languages I’ve been eager to try.
However, I was inspired by Robin Moffat and Francesco Tisiot, who successfully tackled the
challenge using SQL with DuckDB, PostgreSQL, and ClickHouse. Motivated by their
approach, I chose to undertake the challenge using Snowflake.
This article details my various experiments and performance results.
Things to Note on Measuring Performance
In the actual challenge, all programs are run on a Hetzner Cloud CCX33 instance with 8
dedicated vCPU, 32 GB RAM. The time program is used to measure the execution times.
With Robin and Francesco’s work, they ran their databases locally. Clearly that’s not possible
with Snowflake as it is only available in the cloud.
Additionally, the compute power for Snowflake is determined by the virtual warehouse size.
You can think of a virtual warehouse a bit like a remote machine that does compute for you. A
larger warehouse is going to give you more query processing power to reduce the overall time,
so this is not going to be an exact apples to apples comparison, but hey, this is just for fun :-).
For the purposes of my testing, I used the Large warehouse option. In my testing, a Large
warehouse was about 5 times faster during a table copy than an X-Small warehouse.
Loading and Processing CSV Files in Snowflake
The first step is to load the CSV file into Snowflake. There’s a number of ways to do this, from
using Snowsight directly to Snowsql or hosting data externally to Snowflake within a S3
bucket. For this particular situation, Snowsight is not a viable option as the file we need to
process is 13 GBs and Snowsight’s file limit is 50 MBs.
For the challenge, I tested out two primary approaches: an internal stage and external stage,
along with a few variations for each approach.
A Snowflake stage is used for loading files into Snowflake tables or unloading data from tables
into files. An internal stage stores data within Snowflake, while an external stage references
data in a location outside of Snowflake like a S3 bucket.
In the first approach with an internal stage, I used Snowsql to move the file from my local
machine into a Snowflake stage, then copy the file into a table, and execute the SQL against
the table.
The second approach used an external stage that points to an AWS S3 bucket and an external
table. I kept all data external to Snowflake and avoided loading a native table.
There’s pros and cons of both approaches. Let’s start by looking at the internal stage
approach.
The Internal Stage Approach
After cloning the 1brc repo and generating the dataset, the data needs to be uploaded into an
internal stage. With the data uploaded, we’ll copy the data into a table where we can run a
query to produce the required output. The flow for this process is similar to what’s shown in
the diagram below.
Example flow for moving measurement files into a table and the query operation needed to produce
the challenge output.
To. move the measurement data into an internal stage I used the following commands.
CREATE OR REPLACE FILE FORMAT measurement_format
TYPE = 'CSV'
FIELD_DELIMITER = ';';
PUT file:///LOCATION_TO_FILE/1brc/measurements*
@measurements_record_stage
AUTO_COMPRESS=TRUE PARALLEL=20;
I’m compressing the CSV file into a zip file and the PARALLEL parameter helps speed up the
upload by using mutiple threads. The output of the PUT command looks like this:
Next, I used the COPY INTO command to move the stage data into the measurements table.
Doing this with the 13 GB CSV file is quite slow as you can see in the performance table below.
By keeping all the data in a single file, we miss out on the parallel processing power of a cloud-
based platform like Snowflake. The documentation recommends keeping files under 250 MBs
compressed to optimize for parallel loading.
To address this, I split the measurements file into a collection of smaller files. I experimented
with splitting the file into different size chunks to see how that impacted performance. The
best result was with 10 million records per file and a total of 100 files.
Time to copy files into a table depending based on splitting up the files.
Moving the data from the stage into a table is where the bulk of the operational cost is going
to be. Breaking the file up into chunks to parallel process the records during the table copy
process saves a lot of time, but there’s still a significant cost.
In contrast, when handling this challenge in a programming language like Java, you can
process the file and simultaneously compute the required values, storing them in a data
structure tailored for the intended output. Creating a specialize tailored solution is going to
likely be less costly than loading everything into a general-purpose database.
However, in the world outside of this competition, the advantage of a database is that you can
run a myriad of queries against the dataset with very high performance. The flexibility and
efficiency in querying typically outweighs the initial data loading costs.
Querying the Data in Snowflake
The last step is to query the data and produce the results in the format as specified by the One
Billion Row Challenge.
Creating a list of results grouped by location along with the minimum, average, and maximum
values is pretty straightforward.
SELECT location,
MIN(temperature) AS min_temperature,
CAST(AVG(temperature) AS NUMBER(8, 1)) AS mean_temperature,
MAX(temperature) AS max_temperature
FROM measurements
GROUP BY location;
However, the challenge specifies that the data must be printed in alphabetical order in a
single line similar to what’s shown below:
{Abha=5.0/18.0/27.4, Abidjan=15.7/26.0/34.1, Abéché=12.1/29.4/35.6,
Accra=14.7/26.4/33.1, Addis Ababa=2.1/16.0/24.3,
Adelaide=4.1/17.3/29.7, …}
To satisfy this requirement, I created an outer query that concats the bracket, slash, and
comma characters as needed and flattens the results using the LISTAGG function.
With this in place, the average query time with caching disabled is 1.77 seconds.
This gives the following overall timings:
23.03 seconds
- 21.26 seconds is table copy time
- 1.77 second query time
If query caching is on, then once the first query is run, subsequent queries are about 10 times
faster, with an average time of 0.182 seconds.
An overall time of 23.03 seconds is decent, but not spectacular. The best Java programs at the
time of this writing are close to 6 seconds.
Avoiding the Table Copy Operation
Since the bulk of the operational cost of the first approach is the table copy, I attempted to see
how I could avoid that.
Snowflake supports querying directly against an internal stage. The actual query syntax is
very similar to querying against a table as shown below.
Snowflake’s documentation explicitly states that querying a stage is intended to only be used
for simple queries and not a replacement for moving data into a table.
I ignored this warning just to see what the performance would be like. Querying the billion
records contained within a single file was super slow. So slow that I killed the query
operation.
However, similar to the optimization I made for the table copy where I split the records into
multiple files, when I use the 100 files with 10 million records each, I am able to reduce the
total time to calculate the minimum, average, and maximum temperatures and print the
result to 12.33 seconds. That’s about 50% faster than the first method!
The External Stage Approach
I also wanted to see how an external stage would perform. My goal with the external stage is
to keep the data from being copied into Snowflake. Under normal circumstances there could
be financial reasons for doing this, but for this test, I was interested in comparing the
performance with the internal stage approach previously discussed.
To create the external stage, I first uploaded the temperature measurement files to an S3
bucket. Similar to the internal stage approach, I broke up the single file into 100 separate
files.
Using this approach, I got an average end to end time of 24.83 seconds, so similar to the
total time of the first approach with an internal stage and table copy operation. With caching
turned on, subsequent queries average 0.232 seconds.
I also tried seeing if I could improve performance with a materialized view. This turned out to
be much worse. The query time against the materialized view was about the same as the
external table, but there’s an additional cost to create the view, which took on average about
40 seconds.
CREATE OR REPLACE MATERIALIZED VIEW measurements_view AS
Manager Table
-- Using GROUP BY
SELECT department, AVG(salary) AS avg_department_salary
FROM employees
GROUP BY department;
Output:
| department | avg_department_salary |
|------------|-----------------------|
| HR | 52500.00 |
| IT | 65000.00 |
The PARTITION BY clause is used with window functions, which are a set of
functions that perform calculations across a specific range of rows related to the
current row within the result set.
PARTITION BY divides the result set into partitions to which the window
function is applied separately. It doesn't group the rows in the same way
as GROUP BY.
The PARTITION BY clause is applied after the window function in the query
execution.
-- Using PARTITION BY
SELECT employee_id, department, salary,
AVG(salary) OVER (PARTITION BY department) AS
avg_department_salary
FROM employees;
Output:
## Question 6.
Imagine there is a FULL_NAME column in a table which has values like “Elon
Musk“, “Bill Gates“, “Jeff Bezos“ etc. So each full name has a first name, a space
and a last name. Which functions would you use to fetch only the first name
from this FULL_NAME column? Give example.
SELECT
SUBSTR(full_name, 1, POSITION(' ' IN full_name) - 1) as first_name
FROM
your_table_name;
SUBSTR(full_name, 1, POSITION(' ' IN full_name) - 1): This part
of the query uses the SUBSTR function to extract a substring from
the full_name column. The arguments are as follows:
full_name: The source string from which the substring is extracted.
1: The starting position of the substring (in this case, from the beginning of
the full_name).
POSITION(' ' IN full_name) - 1: The length of the substring. It
calculates the position of the space (' ') in the full_name column using
the POSITION function and subtracts 1 to exclude the space itself.
as first_name: This part of the query assigns the extracted substring an alias
"first_name" for the result set.
## Question 7.
How can you convert a text into date format? Consider the given text as “31–01–
2021“.
In SQL, the TO_DATE function is commonly used to convert a text representation of a date
into an actual date format. The syntax of the TO_DATE function varies across database
systems, but you provided an example that looks like it's intended for a system using the 'DD-
MM-YYYY' format.
Here’s an explanation of the SQL query you provided:
SELECT
employee_name,
salary,
CASE
WHEN salary > 50000 THEN 'High Salary'
WHEN salary > 30000 THEN 'Medium Salary'
ELSE 'Low Salary'
END AS salary_category
FROM
employees;
In this example, the CASE statement is used to categorize employees based on their salary. If
the salary is greater than 50,000, the category is 'High Salary.' If the salary is between 30,000
and 50,000, the category is 'Medium Salary.' Otherwise, the category is 'Low Salary.'
## Question 9.
What is the difference between LEFT, RIGHT, FULL outer join and INNER
join?
Joins
In order to understand this better, let’s consider two tables CONTINENTS and COUNTRIES
as shown below. I shall show sample queries considering these two tables.
Table Name: CONTINENTS
Has data of 6 continents. Please note the continent “Antarctica” is intentionally missed from
this table.
SELECT
employee_id,
employee_name,
department,
salary,
SUM(salary) OVER (PARTITION BY department ORDER BY salary) AS
running_total_salary
FROM
employees;
Output:
In this example, the running_total_salary column represents the running total salary
within each department, calculated based on the ascending order of salaries. The PARTITION
BY clause is used to partition the result set by the department column. The ORDER
BY clause specifies the order of rows within each partition based on the salary column.
The SUM function is applied as a window function, and it calculates the running total salary
for each row within its department.
Snowflake Performance Challenges & Solutions (Part 1)
Slim Baltagi
·
Follow
29 min read
·
Nov 8, 2023
Disclaimer: The opinions in this two-part blog series are entirely mine and do not necessarily
reflect my employers (past, present, or future) opinions. This series is neither a critique nor a
praise of the Snowflake Data Cloud. I am attempting to cover a complex topic, Snowflake
Performance Challenges & Solutions, with the hope of paving the way to open, honest, and
constructive related discussions for the benefit of all.
For your convenience, here is the table of contents so you can quickly jump to the item or
items that interest you the most.
TABLE OF CONTENTS
I. Introduction
II. Snowflake Technology (9)
1. Misperceptions and confusion from some Snowflake marketing messages
2. Inherent Snowflake performance limitations
2.1. Performance was not originally a major focus of Snowflake
2.2. Out-of-the-box limited concurrency
2.3. Heavy reliance on caching
2.4. Data clustering limitations
2.5. Vertical scaling is a manual process
2.6. Initial poor performance of ad-hoc queries
2.7. No exposure of functionality like query optimizer hints
2.8. Rigid virtual warehouse sizing leads to computing waste and overspending
2.9. Missing out on performance optimizations from Open File Formats
2.10. No uniqueness enforcement
2.11. No full separation of storage and compute
2.12. Heavy scans
2.13. Out-of-the-box concurrent write Limits
2.14. Data Latency
2.15. Inefficient natural partitioning as the data grows
2.16. Shared Cloud Services layer performance degradation
2.17. No Snowflake query workload manager
2.18. Significant delay in getting large output Resultset in Snowflake UI
3. Forced spending caused by built-in Snowflake service offerings
4. Not enough performance tuning tools and services
5. Overall performance degradation due to data, users, and query complexity growth
6. Performance limitations due to cloud generic hardware infrastructure
7. Dependency on cloud providers and risk of services downtime
8. Limitations in the Query Profile
9. Lack of documentation of error codes and messages
III. Snowflake Ecosystem (3)
1. Inefficient integration with Snowflake
2. Best practices knowledge required by third-party tools and services
3. Piecemeal information about Snowflake performance optimization and tuning
IV. Snowflake Users (4)
1. Snowflake users need to get familiar with Snowflake’s ways of dealing with
performance
2. Snowflake users take shortcuts for solving performance issues
3. Snowflake users lack the knowledge about performance optimization and tuning
4. Snowflake users face a steep learning curve of Snowflake performance
optimization and tuning
4.1. Long-running queries
4.2. Complex queries
4.3. Inefficient queries
4.4. Storage Spillage
4.5. Concurrent queries
4.6. Outlier queries
4.7. Queued queries
4.8. Blocked queries
4.9. Queries against the metadata layer
4.10. Inefficient Pruning
4.11. Inefficient data model
4.12 Connected clients issues
4.13. Lack of sub-query pruning
4.14. Queries not leveraging results cache
4.15. Queries doing a point lookup
4.16. Queries against very large tables
4.17. Latency issues
4.18. Heavy Scanning
4.19. Slow data load speed
4.20. Slow ingestion of streaming data
I. Introduction
You might not be satisfied with the performance of your Snowflake account and need to
optimize it for one or all of the following reasons:
1. Data & Users Growth: As the underlying data grows or the number of users
querying the data increases, queries might start seeing performance issues.
2. Better end-user experience: Users of executive dashboards and data applications
powered by Snowflake would require a better experience.
3. Cost: Striking a balance between performance and cost might be one of your goals
unless you have an unlimited budget! Tuning long-running queries, often
executed, help reduce cost.
4. Service Level Agreement (SLA): Specific use cases might require to meet SLA,
otherwise the business can be negatively impacted. For example, a specific query
should return results in less than 10 seconds.
5. Critical path: Queries rely on the results of other queries. For example, queries
transforming data or reading data would depend on queries loading data.
You might then need some help with understanding Snowflake’s performance challenges and
related solutions. That’s why I am writing this 2-part blog series!
In this first part, I focus on the performance challenges of Snowflake based on a
multidimensional approach that addresses Snowflake technology, Snowflake ecosystem, and
Snowflake users. As users, let’s not blame the technology if we are misusing or abusing
Snowflake! As vendors, let’s not blame the technology if our Snowflake integration is not
efficient or optimal!
In the second part, I will propose solutions for Snowflake performance tuning and
optimization with a unique step-by-step approach based on needs, symptoms, diagnoses, and
remediations.
II. Snowflake Technology (6)
1. Misperceptions and confusion from some Snowflake marketing messages
As a SaaS, Snowflake does not require the management of physical hardware, the installation
of software, and related maintenance. The Snowflake data platform is continually updated in
the background without the need for user involvement. Under the hood, Snowflake takes care
of many performance-related aspects that are usually the responsibilities of the customers in
other data warehouses and data platforms. Examples include horizontal or vertical data
partitioning to specify, data shards for even distribution across nodes, vacuuming, data
statistics collection and maintenance, and distribution Keys.
Many misperceptions and confusions about Snowflake performance tuning and optimization
are due to claims such as:
‘With the arrival of the cloud-built data warehouse, performance optimization
becomes a challenge of the past’. This is claimed by Snowflake Inc. in this white
paper titled ‘How Snowflake Automates Performance in a Modern Cloud Data
Warehouse’ and published on October 16, 2019.
‘Using Snowflake, everyone benefits from performance automation with very
little manual effort or maintenance’. This is claimed by Snowflake inc. in
this white paper titled ‘How Snowflake Automates Performance in a Modern
Cloud Data Warehouse’ and published on October 16, 2019.
Insights into Snowflake’s Near-Zero Management a recorded presentation
published on January 23, 2020. Here is an example of the reaction of a Snowflake
customer to a such statement of ‘Zero Management or Near Zero Management’,
as reported by a Snowflake employee in his blog: “I was recently working for a
major UK customer, where the system manager said ‘Snowflake says it needs
Zero Management, but surely that’s just a marketing ploy’.
Automatic Query Optimization. No Tuning!, a blog by Snowflake Inc. published
on May 19, 2016. Nevertheless, Snowflake offers a USD 3,000 Snowflake
Performance Automation and Tuning 3-Day Training!
Snowflake Data Management — No Admin Required, a recorded presentation by
Snowflake Inc. published on January 13, 2020. Nevertheless, Snowflake Inc.
offers role-based USD 3,000 ‘Administering Snowflake Training’ and USD 375
‘SnowPro Advanced Administrator Certification!
Pacific Life- Busting Bottlenecks for Data Scientists With 1,800x Faster Query
Performance, A case study from Snowflake Inc.
Scale a near-infinite amount of computing resources, up or down, with just a few
clicks. Actually, you might run a query and get the error: “Maximum number of
servers for the account exceeded”. See the article from Snowflake: Query Failed
with Error: Max number of servers for the account exceeded
Such statements not backed by facts from Snowflake Inc. would be considered marketing fluff
and do a disservice to Snowflake Data Cloud technology due to customers not being satisfied
with them and competition exploiting them.
2. Inherent Snowflake performance limitations
2.1. Performance was not originally a major focus of Snowflake: This is a quote from
Snowflake founders in their paper The Snowflake Elastic Data Warehouse by Snowflake
Computing: ‘… Snowflake has only one tuning parameter: how much performance the user
wants (and is willing to pay for). While Snowflake’s performance is already very competitive,
especially considering the no-tuning aspect, we know of many optimizations that we have not
had the time for yet. Somewhat unexpected though, core performance turned out to be almost
never an issue for our users. The reason is that elastic compute via virtual warehouses can
offer the performance boost occasionally needed. That made us focus our development efforts
on other aspects of the system.’
Although Snowflake kept adding new services to improve performance such as Materialized
Views, Auto Clustering, Query Acceleration Service, Search Optimization and a few
transparent enhancements such as data compression rate for new data loaded in Snowflake,
ability to eliminate joins on key columns, … Most either came at additional costs or did not
solve many of its inherent performance limitations as evidenced in the below list.
2.2. Out-of-the-box limited concurrency: 8 concurrent queries per warehouse by default.
Autoscaling up to 10 warehouses. On a single-cluster virtual warehouse, you might hit a limit
of eight concurrent queries. 8 is the default value of the Snowflake
parameter MAX_CONCURRENCY_LEVEL that defines the maximum number of parallel or
concurrent statements a warehouse can execute. See also the article from Snowflake
Knowledge Base Warehouse Concurrency and Statement Timeout Parameters, published on
August 16, 2020. Such a low query concurrency limit forces either increasing the size of a
single-cluster virtual warehouse or starting additional clusters in the case of a multi-cluster
virtual warehouse (in Auto-scale mode). In both cases, this forces you to burn even more
credits compared to what default concurrency you get out of the box from Snowflake!
This is an answer from the Snowflake product team that I am posting as is:” We don’t have a
maximum of 8 concurrent queries: Check out this documentation for more
details: Parameters — Snowflake Documentation Note that this parameter does not limit the
number of statements that can be executed concurrently by a warehouse cluster. Instead, it
serves as an upper-boundary to protect against over-allocation of resources. As each
statement is submitted to a warehouse, Snowflake allocates resources for executing the
statement; if there aren’t enough resources available, the statement is queued or additional
clusters are started, depending on the warehouse.
The actual number of statements executed concurrently by a warehouse might be more or less
than the specified level:
Smaller, more basic statements: More statements might execute concurrently
because small statements generally execute on a subset of the available compute
resources in a warehouse. This means they only count as a fraction towards the
concurrency level.
Larger, more complex statements: Fewer statements might execute
concurrently.”
2.3. Heavy reliance on caching: Heavy reliance of Snowflake on caching can result in
unpredictable and non-optimal performance. At its core, Snowflake is built on a caching
architecture which works well on small scale data sets or repetitive traditional queries. This
architecture starts to fall down as data volumes expand or the workload complexity increases.
In the emerging space of advanced analytics, where machine learning, artificial intelligence,
graph theory, geospatial analytics, time-series analysis, adhoc analysis, and real-time
analytics are becoming predominant in every enterprise — data sets are typically larger and
workloads are becoming much more complex.
This is recognized by Snowflake Inc! “Since end-to-end query performance depends on both
cache hit rate for persistent data files and I/O throughput for intermediate data, it is
important to optimize how the ephemeral storage system splits capacity between the two.
Although we currently use the simple policy of always prioritizing intermediate data, it may
not be the optimal policy with respect to end-to-end performance objectives.” Excerpt from
this presentation and related paper titled ‘Building An Elastic Query Engine on Disaggregated
Storage’ and published in February 2020. PDF (15 pages), Video (19' 57")
2.4. Data clustering limitations: Time-based data is loaded by Snowflake in natural ingestion
order and helps gain performance benefits, by eliminating unnecessary reads through the
combination of automatically creating micro-partitions to hold data in them and
automatically capturing statistics, without any further action required from the user. Such
behavior is not guaranteed to happen if you are loading your data in a random sequence or
using multiple parallel load processes.
Oftentimes, the natural ingestion order is not the optimal physical ordering of data for
customer workload patterns. The user can define a set of key(s) to create a clustered table
where the underlying data is physically stored in the order of a user-defined set of key(s).
Selecting proper clustering keys is critical and requires an in-depth understanding of the
common workloads and access patterns against the table in question. Once the user selects
the proper keys, he can benefit from performance gains through the Snowflake automatic
clustering service.
Snowflake automatic clustering comes with some limitations. Examples include cost-
ineffectiveness for tables that change frequently and clustering jobs that are not always smart.
Snowflake is still improving and optimizing the automatic clustering service but did not
publish the related roadmap.
2.5. Vertical scaling is a manual process: Vertical scaling or scaling up and down by resizing a
virtual warehouse is a manual process. It is also a common misconception to think that the
only solution available to improve query performance is to scale up to a bigger warehouse!
2.6. Initial poor performance of ad-hoc queries: When there is a need to answer questions not
already solved with predetermined or predefined queries and datasets, users create Ad-hoc
queries. For example, analysts might write ad-hoc queries for immediate data exploration
needs that tend to be heavy. Most of the time, analysts might not know what virtual
warehouse size to use, how to tune these queries, or whether it does make sense to tune them.
Related best practices would be to:
Isolate such ad-hoc queries by using a separate virtual warehouse to prevent
them from affecting the performance of other workloads.
Run such ad-hoc queries on bigger warehouses! They will end up running faster
and the cost would be the same compared to running slower in smaller
warehouses.
Use Snowflake Query Acceleration Service (QAS), a feature built into all
Snowflake Virtual Warehouses, to improve Ad-hoc query performance by
offloading large table scans to the QAS service.
2.7. No exposure of functionality like query optimizer hints: Snowflake does not expose
functionality like query optimizer hints that are found in other databases and data
warehouses to control the order in which joins are performed for example.
In some situations, it is not possible for Snowflake optimizer to identify the join ordering that
would result in the fastest execution. You might need to rewrite your query using an approach
that guarantees that the joins are executed in your preferred order. See this Snowflake
Knowledge Base article that is published on May 28, 2019 and titled How To: Control Join
Order
This is an answer from the Snowflake product team that I am posting as is: “This is not a
limitation but a design of Snowflake, to avoid common problems associated with join order
hints such as the query (including joining many tables) are going to have their join order
forced and this render the query very brittle and fragile. if the underlying data changes in the
future, you could be forcing multiple inefficient join orders. Your query that you tuned with
join order could go from running in seconds to minutes or hours.”
Update: Join elimination is a new feature in Snowflake that takes advantage of foreign key
constraints. In a way, this is the first optimizer hint in Snowflake! When two tables are joined
on a column, you have the option to annotate those columns with primary & foreign key
constraints using the RELY property. Setting this property tells the optimizer to check the
relationship between the tables during query planning. If the join isn’t needed, it will be
removed entirely.
2.8. Rigid virtual warehouse sizing leads to computing waste and overspending: Snowflake
virtual warehouses come in fixed sizes that must be manually scaled to the next instance
doubling size and cost to match query complexity. For example, a Large virtual warehouse ( 8
nodes) would need to go to an X-Large virtual warehouse (16 nodes), even if meeting current
query complexity demands would require only one more node. Relying on fixed warehouse
sizing that doubles warehouse sizes and costs every time a little more query performance is
needed leads to computing waste and overspending.
2.9. Missing out on performance optimizations from Open File Formats: Snowflake users
might miss out on performance optimizations that common Open file formats such as Apache
Arrow, Apache Parquet, Apache Avro, and Apache ORC offer. Snowflake can import data
from these Open formats to its proprietary file format (FDN: Flocon De Neige) but can not
directly work with these open file formats.
With the announcement of the new Snowflake feature called Iceberg table format, Snowflake
is adding support for the Apache Parquet file format. Iceberg Tables are in a private preview
as of the publishing date of this article and are not publicly available to Snowflake customers
yet. Snowflake announced on January 21, 2022, expanded support for Iceberg via External
Tables. At the Snowflake Summit on June 14, 2022, Snowflake announced a new type of
Snowflake table called Iceberg Tables: “In this 6 minutes demo, Snowflake Software Engineer
Polita Paulus shows you how a new type of Snowflake table, called an Iceberg Table, extends
the features of Snowflake’s platform to Open formats, Apache Iceberg and Apache Parquet, in
storage managed by customers. You can work with Iceberg Tables as you would with any
Snowflake table, including being able to apply native column-level security, without losing the
interoperability that an open table format provides.”
You might wonder what is the difference between support of Snowflake of Apache Iceberg via
External Tables or Iceberg Tables. External tables are read only while Iceberg tables allow
read, insert and update.
2.10. No uniqueness enforcement: There is no way to enforce uniqueness in inserted data. If
you have a distributed system and it writes data on Snowflake, you will have to handle the
uniqueness yourself either on the application layer or by using some method of data de-
duplication.
Snowflake announced Unistore, not yet in public preview. This means Snowflake has a new
Hybrid Table Type that allows Unique, Primary, and Foreign Key constraints. Here’s a 6-
minute youtube demo.
2.11. No full separation of storage and compute
“Since end-to-end query performance depends on both cache hit rate for persistent data files
and I/O throughput for intermediate data, it is important to optimize how the ephemeral
storage system splits capacity between the two. Although we currently use the simple policy of
always prioritizing intermediate data, it may not be the optimal policy with respect to end-to-
end performance objectives.” Excerpt from this presentation and related paper titled ‘Building
An Elastic Query Engine on Disaggregated Storage’ and published in February 2020. PDF (15
pages), Video (19' 57")
2.12. Heavy scans: Before copying data, Snowflake checks that files have not already been
loaded. This will affect load performance. To avoid scanning terabytes of files that have
already been loaded, you can simply partition your staged data files! This will help maximize
your load performance.
2.13. Out-of-the-box concurrent write Limits: Snowflake has a built-in limit of 20 DML
statements that target the same table concurrently, including COPY, INSERT, MERGE,
UPDATE, and DELETE. Snowflake users might not be aware of such a concurrent write limit
as this is not in Snowflake documentation. The root cause seems to be related to constraints
in the Global Service Layer’s metadata repository database. You need to be aware that
Snowflake is designed for high-volume high-concurrency reading and not writing.
To overcome such a performance challenge of concurrent write limits in Snowflake, you need
to architect around it as explained in this article published on July 26, 2019.
2.14. Data Latency: Latency issues in Snowflake can be due to many reasons. Examples
include Snowflake and other data services not in the same cloud region. To avoid such latency
issues, you’d need to have Snowflake on the same public cloud and the same or closer cloud
region where your data, users, and other services reside.
2.15. Inefficient natural partitioning as the data grows: By default, Snowflake auto clusters
your rows according to their insertion order. Because insertion order is often correlated with
dates, and dates are a popular filtering condition, this default clustering works quite well in
many cases. However, when you have too many DML statements after your initial data load,
your clustering may lose its effectiveness over time: more overlap means fewer pruning
opportunities. Using SYSTEM$CLUSTERING_DEPTH system function helps with seeing
how effective your micro-partitions are in terms of a specified set of columns. It returns the
average depth as an indicator of how much your micro-partitions overlap. The lower this
number, the less overlap, the more pruning opportunities, and the better your query
performance when filtering on those columns.
2.16. Shared Cloud Services layer performance degradation: Snowflake shared Cloud Service
Layer in a particular region can become overloaded.
This is a real-world interaction with Snowflake Customer Service with their analysis of why
our metadata queries either were extremely slow or timed out: “Our team has reviewed the
information provided. We found on September 1, 2022 there was an issue with our Cloud
Services servers encountering heavier than normal usage. As a result, some queries had
timeouts reading metadata resulting in incidents. This issue with the servers was identified by
the Snowflake team during the event and steps were taken to resolve the issue. We apologize
for the inconvenience. Thanks”. A potential solution is for Snowflake to predictively scale
Cloud Services clusters prior to hitting any resource limits.
In addition, an outage of the shared Cloud Services will affect multiple customers. Getting
technical support will become a big challenge.
2.17. No Snowflake query workload manager: Unlike other data warehouses, Snowflake lacks
a resource manager to assign resources to queries based on their importance and overall
system workload.
2.18. Significant delay in getting large output Resultset in Snowflake UI: You might
experience query slowness due to fetching a large output resultset.
You can set this parameter UI_QUERY_RESULT_FORMAT to ARROW at the
session/account/user level and test if you are getting the results faster:
alter session set UI_QUERY_RESULT_FORMAT=’ARROW’;
This post, How To: Resolve Query slowness due to large output published on March 28, 2022,
provided the solution to avoid query slowness due to fetching large output resultset.
3. Forced spending caused by built-in Snowflake service offerings
The use of performance-related additional services from Snowflake forces a cost increase for
users and challenges them to strike a balance between performance and cost:
3.1. Automatic Clustering Service: Reorganize table data to align with query patterns.
Clustering in Snowflake relates to colocating rows with other similar rows in a micro
partition. Data that is well clustered can be queried faster and more affordably due to
partition pruning that allows Snowflake to skip data that does not pertain to the query based
on statistics of the micro partitions. When clustering is defined, the Automatic Clustering
service, in the background, rewrites micro-partitions to group rows with similar values for the
clustering columns in the same micro-partition.
Snowflake users might make some clustering choices that don’t have any performance gains
and are a waste of spending:
DO NOT: Cluster by a timestamp when most predicates are by DAY.
DO NOT: Cluster on a VARCHAR field that has low cardinality prefix
DO NOT: Try to compensate for starting character cardinality with a HASH function
DO NOT: Change clustering in production on a table w/o a manual rewrite
This quick write-up Tips for Clustering in Snowflake will help you understand what clustering
is on Snowflake, Why it is important, How to pick cluster keys, and what not to do.
3.2. Materialized Views: To boost query performance for workloads that contain a lot of the
same queries, you might create materialized views to store frequently used projections and
aggregations. Although the results from related materialized views queries are guaranteed to
be up-to-date, there are extra costs associated with those materialized views. As a result,
before you create any materialized views, think about whether the expenses will be
compensated by the savings from reusing the results frequently enough.
3.3. Search Optimization Service: To quickly find the needle in the haystack to return a small
number of rows on a large table, you might enable search optimization on that table. The
maintenance service begins constructing the table’s search access paths in the background
and might massively parallelize related job. This might result in rapid increase in spending.
Before enabling search optimization on a large table, you can get an estimate of the spending
using SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION_COSTS so you know what you’re in
for.
3.4. Query Acceleration Service: Elastic scale out of computing resources without changing
the size of virtual warehouses. See What is Snowflake Query Acceleration Service?
4. Not enough performance tuning tools and services
Despite already celebrating its 10th anniversary, Snowflake is considered relatively new and
evolving fast compared to legacy data warehouses and platforms that have matured over
decades and provided fine-grained tuning tools that DBAs typically use.
Although Snowflake features for investigating slow Snowflake performance issues, such as
query profile, query history page, and warehouse loading charts, offer valuable data and
insights, there is a need for more tuning tools and services. For example, some additional
tools that might be helpful would be Clustered Tables Explorer & Analyzer, Warehouse Idle
Time Alerter, …
5. Overall Performance degradation due to data, users, and query complexity
growth
Queries might start seeing performance issues as underlying data grows or the number of
users querying data increases. Although performance degradation is not specific to
Snowflake, users find it difficult to fix it in Snowflake with limited tuning options. They lack
features, such as indexes and workload managers, they are used from other data platforms
and tools to investigate and fix Snowflake performance issues.
6. Performance limitations due to cloud generic hardware infrastructure
Snowflake runs on generic and general-purpose hardware infrastructure available in AWS,
GCP, or Azure. Such cloud hardware limits Snowflake’s performance as it is not optimized for
data warehousing workloads and other workloads that Snowflake supports.
In addition, unlike competitors, Snowflake does not offer any performance tuning through
hardware selection. Snowflake does not disclose its hardware specs to allow customizing
performance through hardware. Such lack of transparency of Snowflake about its hardware
specs reduces the flexibility in customizing Snowflake performance through hardware beyond
simply selecting a virtual warehouse size without knowing underlying details or having any
other hardware-related choices.
At its 2022 Summit, Snowflake announced ‘Transparent engine updates’: “For those of you
running on AWS, you will get faster performance for all of your workloads. We’ve optimized
Snowflake to take advantage of new hardware improvements offered by AWS, and we are
seeing 10% faster compute on average in the regions already rolled out. No user intervention
or choosing a particular configuration is required for this latest performance enhancement.”
See blog post published on July 27, 2022 and titled Snowflake’s New Engine and Platform
Announcements
Snowflake announced at its 2022 summit, large memory instances for ML workloads to
enable the training of models inside Snowflake. See demo titled ‘Train And Deploy Machine
Learning Models With Snowpark For Python’ and published on June 14, 2022
7. Dependency on cloud providers and risk of services downtime
Snowflake depends on many services provided by cloud providers such as AWS, GCP, and
Azure. When such services suffer a performance degradation or go down, this directly affects
your Snowflake account.
Downtime is a disadvantage of cloud computing as cloud services, including Snowflake, are
not immune to such outages or slowdowns. That is an unfortunate possibility and can occur
for any reason. Your business needs to assess the impacts of an outage, slowdown, or any
planned downtime from Snowflake, follow best practices for minimizing impact or slowdowns
and outages, and implement solutions such as Business Continuity and Disaster Recovery
plans. You can subscribe to Snowflake status updates here.
8. Limitations in the Query Profile
Snowflake offers Query Profile to analyze queries. In Snowflake SnowSight UI, in the Query
Profile view, there is a section called Profile Overview where you can see the breakdown of the
total execution time. It contains statistics like Processing, Local Disk I/O, Remote Disk I/O,
Synchronization etc. At this time, there is no way in Snowflake to access those statistics
programmatically instead of having to navigate to that section for each query that you want to
analyze. This is a frequent request from customers and it might get offered one day!
Meanwhile, a couple workarounds are to use the open source project ‘Snowflake Snowsight
Extensions’ that enables manipulation of Snowsight features from command-line. You can
also wait for GET_QUERY_STATS, system function that returns statistics about the
execution of one or more queries, available in the query profile tab in Snowsight, via a
programmatic interface. It is still in private preview as of the date of publication of this article.
Update: GET_QUERY_OPERATOR_STATS(), in public preview available to all accounts, is a
system function that returns statistics about individual query operators within a query. You
can run this function for any query that was executed in the past 14 days. Such statistics,
available in the query profile tab in Snowsight, are now available via a programmatic
interface. See blog from Snowflake Inc. published on December 20, 2022 and titled Analyze
Your Query Performance Like Never Before with Programmatic Access to Query Profile.
Update: ‘Programmatic Access to Query Profile Statistics, in public preview soon, which will
let customers analyze long-running and time-consuming queries more easily. This view will
also identify (and let users resolve) performance problems such as exploding joins before they
impact the end user.’ See blog from Snowflake published on November 8th, 2022: A Faster,
More Efficient Snowflake Takes the Stage at Snowday
9. Lack of documentation of error codes and messages
Snowflake might display cryptic error messages that are not available in Snowflake
documentation. You need to contact Snowflake technical support or dig into source code of
connectors and drivers in GitHub repositories. This does not help with all parts of Snowflake
data platform. Although Snowflake is made aware of this issue, it does not have any plan on
adding related documentation at this time!
III. Snowflake Ecosystem (3)
1. Inefficient integration with Snowflake
Code generated by third-party tools to connect and integrate to Snowflake could be inefficient
from a performance perspective. For Example, metadata-related queries generated as
introspection code can be extremely slow
2. Best practices knowledge required by third-party tools and services
Almost every third-party tool and service from the Snowflake ecosystem comes with an
exhaustive list of best practices to follow! The burden is on Snowflake users to be aware of
such best practices and implement them when integrating with Snowflake.
3. Piecemeal information about Snowflake performance optimization and
tuning
Some information about Snowflake’s performance, available piecemeal here and there, lacks
structure, needs consolidation, and might be outdated or misleading.
IV. Snowflake Users (3)
1. Snowflake users need to get familiar with Snowflake’s ways of dealing
with performance
Users familiar with their legacy systems and migrating to Snowflake using a lift and shift
approach might encounter performance issues in their Snowflake account. Some of the old
ays they used to with their legacy systems such as indexes and won’t work in Snowflake! They
would need to tune their Lift and Shift approach and take a fresh look at their performance
problems.
2. Snowflake users take shortcuts for solving performance issues
Snowflake makes it incredibly easy for users to try and fix slow workloads by simply throwing
more compute resources at them. Snowflake charges for services such as automated
clustering that help optimize performance when used properly.
Instead of identifying the root causes of performance problems, following known Snowflake
performance best practices, and avoiding anti-patterns, Snowflake users take shortcuts. To
solve some of their performance problems, they opt for brute force and the ease of use of
features such as scaling up, scaling out, and automatic clustering at the cost of unnecessarily
computing costs!
3. Snowflake users lack the knowledge about performance optimization and
tuning
Snowflake users are a broad mix of business and technical users with varying levels of
Snowflake proficiency. As a result, they might have inefficient queries and processes of data
ingestion, transformation, and consumption.
4. Snowflake users face a steep learning curve of Snowflake performance
optimization and tuning
Like any new change or experience, it can be jarring to transition to Snowflake and learn more
about its performance optimization and tuning. In addition, as Snowflake keeps quickly
evolving and adding new features and functionality, you need to keep yourself up-to-date to
get the most out of your Snowflake account. Snowflake does not remove the need for highly
skilled technical resources to operate and tune a Snowflake account.
Here are some examples of Snowflake performance aspects you need to keep learning about
and related challenges you need to solve:
4.1. Long-running queries: You might need to seek and destroy long running queries that are
result of mistakes, are eating up compute resources and are not adding any value! This blog
posted on September 5th 2022 and titled Snowflake: Long Running Queries Seek &
Destroy shows an example of a long running query and 2 ways on how to solve this problem.
Using QUERY_HISTORY, you can accurately identify long running queries by looking at their
EXECUTION_TIME. Please note, that the TOTAL_ELAPSED_TIME includes the time the
query spent sitting on queue as the warehouse is being provisioned or overloaded. See this
blog published on January 10, 2022 and titled Long Running Queries in Snowflake:
QUERY_HISTORY function.
4.2. Complex queries: If a query is spending more time compiling compared to executing,
perhaps it is time to review the complexity of the query. See article Understanding Why
Compilation Time in Snowflake Can Be Higher than Execution Time
4.3. Inefficient queries: Although tuning and optimizing SQL queries is a big topic, some
known guardrails could help you pick some low-hanging fruits of performance improvement.
Examples include
Select *: When fetching required attributes, avoid using the SELECT * statement
as it conveys all the attributes from the storage to the warehouse cache. This
slows down the process and fills the warehouse cache with unwanted data. Forget
about SELECT * queries! Snowflake is a columnar data store, explicitly write only
the columns you need or use or specify the columns that should be excluded from
the results using EXCLUDE within a SELECT * EXCLUDE (col_name, col_name,
…). See Selecting All Columns Except Two or More Columns
Exploding joins: This happens when joining tables without providing a join
condition (resulting in a “Cartesian product”), or providing a condition where
records from one table match multiple records from another table. For such
queries, the Join operator produces significantly (often by orders of magnitude)
more tuples than it consumes. This is a common mistake to avoid when writing
code to join tables and to identify using the Query Profile. See related Snowflake
documentation “Exploding” Joins and the Snowflake Knowledge Base article
published on January 15, 2019 and titled How To: Recognize Row Explosion
UNION without ALL: In SQL, it is possible to combine two sets of data with
either UNION or UNION ALL constructs. The difference between them is that
UNION ALL simply concatenates inputs, while UNION does the same, but also
performs duplicate elimination. A common mistake is to use UNION when the
UNION ALL semantics are sufficient. These queries show in Query Profile as a
UnionAll operator with an extra Aggregate operator on top (which performs
duplicate elimination). The best practice os to use UNION ALL instead of UNION
when deduplication is not required (or if we know that the rows being combined
are already distinct). See related Snowflake documentation UNION Without ALL.
Not using ANSI join: “There can be potential performance differences between
non-ANSI and ANSI join syntax as the parser does not currently
transform/rewrite the join using the ANSI syntax after parsing. As a result, Non-
ANSI syntax doesn’t benefit from the subsequent steps of transformation and
optimization by the compiler that eventually generate a good plan.” See
Snowflake Knowledge Base article, published on June 18, 2022, and
titled Performance Implications of using Non-ANSI syntax versus ANSI join
syntax.
Using String instead of Date or Timestamp data type: Date/Time Data Types for
Columns: “When defining columns to contain dates or timestamps, Snowflake
recommends choosing a date or timestamp data type rather than a character data
type. Snowflake stores DATE and TIMESTAMP data more efficiently than
VARCHAR, resulting in better query performance. Choose an appropriate date or
timestamp data type, depending on the level of granularity required.” .
See Snowflake Table Design considerations
People might learn SQL on the job and make their fair share of mistakes, including similar
queries to the above ones. They can learn how to write SQL more efficiently. by taking an
online SQL class or a training that focus on SQL for Snowflake.
4.4. Storage Spillage: For some operations (e.g. duplicate elimination for a huge data set), the
amount of memory available for the compute resources used to execute the operation might
not be sufficient to hold intermediate results. As a result, the query processing engine will
start spilling the data to the local disk. If the local disk space is not sufficient, the spilled data
is then saved to remote disks. Disk drives are a lot slower than ram. This spilling can have a
profound effect on query performance, especially if a remote disk is used for spilling. By using
a larger sized warehouse, you also get more ram. Thus, the query or load can complete so
much faster that you are saving more than the extra cost of being large. See Snowflake
Knowledge Base article published on February 5th 2019 and titled How To: Recognize Disk
Spilling
4.5. Concurrent queries: If a user experiences a performance issue for a query, one should
find out the queries running concurrently with a problematic query in the same warehouse.
Concurrent queries utilizing the same virtual warehouse can cause the sharing of resources
such as CPU, Memory, and other key resources. This may potentially impact the response
time for concurrent queries.
See the article from Snowflake Knowledge Base How To: Query to find concurrent queries
running on the same warehouse, published on July 27, 2020.
4.6. Outlier queries: Example of outlier queries include those that use more resources than
other queries in the same Snowflake virtual warehouse. See upcoming Snowflake Query
Acceleration Service, in public preview as of the writing date of this article, that would help
with a related solution. You can add QUERY_ACCELERATION service via Snowsight when
creating a compute cluster to make complex queries run faster by injecting additional CPU
horsepower from Snowflake serverless pools while the remaining simpler queries execute
using only the regular nodes that are in your cluster. No need to increase the size of the entire
cluster & waste money just to speed up some of the more complex queries when most other
queries do not need that extra CPU power.
4.7. Queued queries: See Snowflake Knowledge Base article, published on September 9,
2020 How To: Understand Queuing about explaining of the different types of queuing and
what to do when it occurs
4.8. Blocked queries: A blocked query is attempting to acquire a lock on a table or partition
that is already locked by another transaction. Account administrators (ACCOUNTADMIN
role) can view all locks, transactions, and session with: SHOW LOCKS [IN ACCOUNT]. For
all other roles, the function only shows locks across all sessions for the current user.
References:
Snowflake Knowledge Base article titled How To: Resolve blocked queries and
published on May 18, 2017.
Blog ‘Transaction Locks in Snowflake’ blog by Sachin Mittal published on March
15, 2022.
4.9. Queries against the metadata layer: A query might taking more time on Metadata
Operations. The Information Schema views are optimized for queries that retrieve a small
subset of objects from the dictionary. Whenever possible, maximize the performance of your
queries by filtering on schema and object names. For more usage information and details, see
the Snowflake Information Schema related documentation.
4.10. Inefficient Pruning: ‘Snowflake collects rich statistics on data allowing it not to read
unnecessary parts of a table based on the query filters. However, for this to have an effect, the
data storage order needs to be correlated with the query filter attributes. The efficiency of
pruning can be observed by comparing Partitions scanned and Partitions total statistics in the
TableScan operators. If the former is a small fraction of the latter, pruning is efficient. If not,
the pruning did not have an effect. Of course, pruning can only help queries that filter out a
significant amount of data. If the pruning statistics do not show data reduction, but there is a
Filter operator above TableScan which filters out a number of records, this might signal that a
different data organization might be beneficial for this query.’ See related Snowflake
documentation on Inefficient pruning and Understanding Snowflake Table Structures. See
also Snowflake Knowledge Base article published on January 24, 2019 and titled How To:
Recognize Unsatisfactory Pruning
4.11. Inefficient data model: An inefficient data model can negatively impact your Snowflake
performance. Often overlooked, optimizing your data model will improve your Snowflake
performance. Snowflake supports various modeling techniques such as Star, Snowflake, Data
Vault and BEAM. Let your usage patterns drive your data model design. Think about how you
foresee your data consumers and business applications leveraging data assets in Snowflake.
4.12 Connected clients issues: You might be connecting to Snowflake using clients that are out
of Snowflake support and not benefiting from fixes, performance and security enhancements,
and new features. You will need to periodically check the versions of your connectors to
Snowflake and upgrade. Snowflake sends a quarterly email on behalf of Support Notifications
titled ‘Quarterly Announcement — End of Support Snowflake Client Drivers: Please Upgrade!’
4.13. Lack of sub-query pruning: Pruning is not happening when using Sub-query. See this
Snowflake Knowledge Base article
4.14. Queries not leveraging results cache: When a query is executed, the result is persisted
(i.e. cached) for a period of time. At the end of the time period, the result is purged from the
system. For persisted query results of all sizes, the cache expires after 24 hours. If a user
repeats a query that has already been run, and the data in the table(s) hasn’t changed since
the last time that the query was run, then the result of the query is the same. Instead of
running the query again, Snowflake simply returns the same result that it returned previously.
This can substantially reduce query time because Snowflake bypasses query execution and,
instead, retrieves the result directly from the cache. To lear more check Using Persisted Query
Results.
4.15. Queries doing a point lookup: You might need to improve the performance of selective
point-lookup queries that returns only one or a small number of distinct rows from a large
table. Search Optimization Service (SOS) is a feature that optimises all supported columns
within a table. It is a table-level property, once added on a table, a maintenance service
creates and populates search access path that are used to perform lookups.
4.16. Queries against very large tables: You might have a very large table and most of your
queries are selecting only the same few columns, have about the same filters and doing
aggregates on the same column. Creating a Materialized View (MV) with these repeating
patterns can greatly help these queries. The idea is that, the Materialized View already holds
the resulting data of these repeating patterns ready for retrieval rather than performing them
over and over against the table.
Over time, as the data changes in a very large table as a result of DML transactions, the
distribution of the data becomes more and more disorganized. Automatic Clustering enables
users to designate a column or a set of columns as the Clustering Key. Snowflake uses this
Clustering Key to reorganize the data so that related records are relocated in the same micro-
partitions enabling more efficient partition pruning. Once a Clustering Key is defined, the
table will be reclustered automatically to maintain optimal data distribution. Automatic
Clustering will then help with queries against large tables that uses range or equality filters on
the Clustering Key.
4.17. Latency issues
4.18. Heavy Scanning
4.19. Slow data load speed: It can be due to many reasons, such as not partitioning staged
data files to avoid scanning files that have already been loaded or not breaking up single large
files into the appropriate size to benefit from Snowflake’s automatic parallel execution.
4.20. Slow ingestion of streaming data
What else would you like to add to the above Snowflake performance
challenges? I much appreciate your comments and feedback.
Please, spread the word, and don’t forget to give this article a ‘Like’ if you find it helpful so
that your LinkedIn buddies can read it and comment on it too. I hope this 2-part blog series
will pave the way to further constructive discussions about Snowflake performance challenges
and solutions for the benefit of all.
Task scheduling and triggered tasks both perform the same function in
Snowflake, but task scheduling can be more resource-intensive and
costly.
Triggered tasks execute when a stream has data. They were originally set to run every minute
with SCHEDULE=’1 minute’ and
WHEN system$stream_has_data(‘my_stream_name’).
Scheduling tasks in this way often results in tasks running constantly, even when performing
no work.
Triggered tasks eliminate the need to schedule tasks to run constantly.
To create a triggered task, omit the SCHEDULE parameter from the task as shown:
Support for this feature is currently not in production and is available only to
selected accounts.
Triggered tasks:
Only execute the associated stream when changes are made to the table, and run the stream
shortly after the data change is committed (usually within a few seconds).
Benefits of Triggered Tasks:
Task execution is made simple and automatic through data change triggers. Tasks are
triggered automatically when data is available in the stream, eliminating the need for
schedules. This approach reduces cloud service charges and load, freeing up resources for
other tasks that perform useful work instead of simply polling for work.
Faster triggered tasks can execute every 15 seconds by modifying the
USER_TASK_MINIMUM_TRIGGER_INTERVAL_IN_SECONDS parameter. By default,
triggered tasks execute at most every 30 seconds.
Once when the task is first resumed, to consume any data already in the stream.
For example, a stream created with the SHOW_INITIAL_ROWS = true.
• Once every 12 hours. if not previously triggered during that time period. If there
is no data in the stream, the run will be skipped, and no compute resources will
be used.
Differences between Triggered Tasks and Scheduled Tasks:
Triggered task differ from scheduled tasks, in that scheduled tasks specify the SCHEDULE
parameter. Triggered tasks generally behave the same way as scheduled tasks in most
respects, with the following exceptions:
In SHOW TASKS / DESC TASK output, the SCHEDULE property displays NULL for
triggered tasks.
If the stream or table that the stream is tracking is dropped (including CREATE OR
REPLACE, which drops it and creates a new one of the same name), the triggered task will no
longer be able to trigger and will automatically suspend. Essentially the result is the same as if
you ran ALTER TASK <task_name> SUSPEND. After the table and/or stream are re-created,
the user can run ALTER TASK <task_name> RESUME to resume triggered processing.
WHEN conditions such as NOT stream_has_data(‘my_stream’) and
stream_has_data(‘my_stream’) = false are not allowed; triggered tasks are only supported on
data changing, and not data remaining the same.AND/OR conditions are allowed, such as
stream_has_data(‘stream’) OR stream_has_data(‘stream2’) will trigger when either of the
streams have data. This can result in a skipped task if an AND condition is used and only one
of the streams has data.
Suresh Reddy
·
Follow
5 min read
·
Aug 13, 2023
Data migration is a critical process for any organization looking to upgrade its systems or
adopt new technologies. However, it can also be a daunting and complex task that requires
expertise and experience to ensure a smooth transition. As a data migration consultant with
over 8 years of expertise in the field, I have faced numerous challenges and learned valuable
insights along the way.
In this post, I’m going to share some common questions I’ve come across for data migration
consultant interviews to give you an edge. Let’s get started!
Question: Can you recall a tough data migration project you tackled and how you overcame
its challenges?
Answer: Yes, I once worked on a project where we had to migrate data from a legacy system
with minimal documentation. This is a very common problem in most of the legacy migration
projects. The key was to first understand the data structure through reverse engineering and
then create a comprehensive migration plan. Please start regular communication with the
original system’s team/business users and rigorous testing helped us ensure a smooth
transition.
Question: Can you differentiate between data migration and data integration?
Answer: Data migration is about moving data, like when we transitioned from an on-
premises Oracle database to Google’s BigQuery. On the other hand, data integration is about
combining data. For instance, I once integrated real-time sales data from Shopify with
inventory data in an S3 bucket using tools like Kafka and Dell Boomi.
Question: How do you ensure data quality during migration? Any tools you recommend?
Answer: Data quality is paramount and my favourite part of data migration projects. For
projects like Banking migration projects, ensuring accuracy was non-negotiable. I often lean
on data profiling and cleansing before migration. Tools like Talend, Informatica Data Quality,
and Alteryx have been my go-to for such tasks.
Question: Which data migration tools are you familiar with, and which one do you favour?
Answer: I’ve had hands-on experience with AWS DMS, Talend, and Informatica
PowerCenter. For a project involving a cloud-to-cloud migration that…
Suresh Reddy
·
Follow
5 min read
·
Aug 13, 2023
Data migration is a critical process for any organization looking to upgrade its systems or
adopt new technologies. However, it can also be a daunting and complex task that requires
expertise and experience to ensure a smooth transition. As a data migration consultant with
over 8 years of expertise in the field, I have faced numerous challenges and learned valuable
insights along the way.
In this post, I’m going to share some common questions I’ve come across for data migration
consultant interviews to give you an edge. Let’s get started!
Question: Can you recall a tough data migration project you tackled and how you overcame
its challenges?
Answer: Yes, I once worked on a project where we had to migrate data from a legacy system
with minimal documentation. This is a very common problem in most of the legacy migration
projects. The key was to first understand the data structure through reverse engineering and
then create a comprehensive migration plan. Please start regular communication with the
original system’s team/business users and rigorous testing helped us ensure a smooth
transition.
Question: Can you differentiate between data migration and data integration?
Answer: Data migration is about moving data, like when we transitioned from an on-
premises Oracle database to Google’s BigQuery. On the other hand, data integration is about
combining data. For instance, I once integrated real-time sales data from Shopify with
inventory data in an S3 bucket using tools like Kafka and Dell Boomi.
Question: How do you ensure data quality during migration? Any tools you recommend?
Answer: Data quality is paramount and my favourite part of data migration projects. For
projects like Banking migration projects, ensuring accuracy was non-negotiable. I often lean
on data profiling and cleansing before migration. Tools like Talend, Informatica Data Quality,
and Alteryx have been my go-to for such tasks.
Question: Which data migration tools are you familiar with, and which one do you favour?
Answer: I’ve had hands-on experience with AWS DMS, Talend, and Informatica
PowerCenter. For a project involving a cloud-to-cloud migration that…