You are on page 1of 185

Conclusion

Snowflake's performance capabilities are unparalleled, but to truly maximize performance and
minimize costs, Snowflake performance tuning and optimization are essential. While Snowflake
can handle massive data volumes and complex queries with ease, it does not mean it always
should.
In this article, we explored several Snowflake features and best practices to boost query
performance.To summarize:
 Monitor and reduce queueing to minimize wait times. Queueing is often the culprit
behind sluggish queries.
 Leverage result caching to reuse results and slash compute time.
 Tame row explosion with techniques like Optimizing data types and data volumes,
Optimizing subqueries, Making use of DISTINCT clause, Avoiding cartesian products
and making use of Temporary tables.
 Optimize query pruning to scan less data and improve query speed.
 Manage and monitor disk spillage to avoid performance impacts.
While this covers many essential Snowflake performance tuning techniques, it only scratches the
surface of Snowflake's capabilities. This concludes Part 3 of our article series on Snowflake
performance tuning. If you missed the previous installments, be sure to check out Part 1 and Part
2 for a comprehensive understanding of the topic.
FAQs
What are some common Snowflake Performance Tuning techniques?
Common Snowflake Performance Tuning techniques include query optimization, data clustering,
result caching, monitoring disk spillage and using the right Snowflake warehouse size for your
workload.
How does query optimization improve Snowflake performance?
Query optimization in Snowflake can significantly improve performance by reducing the amount
of data scanned during a query, thus reducing the time and resources required to execute query.
What is the role of data clustering in Snowflake Performance Tuning?
Data clustering in Snowflake helps to minimize the amount of data scanned during a query,
which can significantly improve query performance.
How does warehouse size affect Snowflake performance?
The size of the Snowflake warehouse can greatly impact performance. Larger warehouses can
process queries faster, but they also consume more credits. Therefore, it's important to choose
the right size for your specific needs.
Can caching improve Snowflake performance?
Yes, caching can significantly improve Snowflake performance. Snowflake automatically caches
data and query results, which can speed up query execution times.
How do you check Snowflake performance?
There are several ways to check your Snowflake performance, including monitoring query
execution times, analyzing query plans, and reviewing resource usage. You can also use the
Snowflake web interface (Snowsight), Snowflake's built-in performance monitoring tools, such
as the Query Profile and Query History views, the ACCOUNT_USAGE schema, as well as
third-party tools like Chaos Genius, to gain insights into your system's performance and identify
areas for improvement.
What makes Snowflake fast?
Snowflake's architecture is designed to be highly scalable and performant. It uses a unique
separation of compute and storage, allowing for independent scaling of each component. Also,
Snowflake uses a columnar data format and advanced compression techniques to minimize data
movement and optimize query performance
How do you handle long running queries in a Snowflake?
To handle long-running queries in Snowflake, you can use the 'QUERY_HISTORY' function to
identify them. Once identified, you can either manually cancel the query using the
'SYSTEM$CANCEL_QUERY' function, or optimize the query for better performance.
What is Snowflake stage?
A Snowflake stage is a location where data can be stored within the Snowflake data warehouse.
It can be thought of as a folder or directory within the Snowflake environment where data files
in various formats (such as CSV, JSON, or Parquet) can be stored and accessed by Snowflake
users.
What is the difference between Snowflake stage and External Tables?
A Snowflake stage is a storage location for data files, whereas an external table is a virtual table
that points to data stored outside Snowflake. The key difference is that a stage loads data into
Snowflake, while an external table enables querying of data located external to Snowflake.
What is the difference between external table and regular table in Snowflake?
External tables reference external data sources. Regular tables store data natively within
Snowflake.
What is the difference between Snowpipe and Snowflake external table?
Snowpipe in Snowflake is an automated data ingestion service that continuously loads data from
external sources into Snowflake tables. On the other hand, an external table is a virtual table that
references data stored outside of Snowflake.
What are the supported Snowflake file formats?
Snowflake supports a variety of file formats, including CSV, JSON, Avro, Parquet, ORC, and
XML.
How do you use a file format in Snowflake?
Create it with CREATE FILE FORMAT, specifying the format type, compression, encoding,
etc. Use it when loading/unloading data.
What is default Snowflake file format?
The default Snowflake file format is CSV. This means that if you do not specify a file format
when loading or unloading data, Snowflake will use the CSV format.
What is the best Snowflake file format?
It depends on use case. Parquet/ORC for structured data, JSON/Avro for semi-structured, CSV
for simple use cases. Consider size, performance, compression.
How do I get a list of Snowflake file formats in a Snowflake?
Use SHOW FILE FORMATS to display names, types, and properties of existing file
foSnowflake Query Tagging by User/Account/Session
You can set Snowflake query tags at three different levels:
 Account
 User
 Session
When Snowflake query tags are set at multiple levels, they are applied in a specific order of
precedence. So at the account level, a query tag is set for the entire Snowflake account. If a user-
level query tag is set for a specific user in the Snowflake account, it will override the account-
level query tag for any queries that the user runs.

In the same way, if a user-level or account-level query tag is already in place and a session-level
query tag is set, the session-level query tag will take priority over the user-level, or account-
level tag for any queries run during that session.

Hence, Snowflake query tags can be set at multiple levels, but the order of precedence
determines which tag is applied for any given query.
Here's an in-depth explanation of each level:
Account-Level Query Tag:
An account-level query tag is a tag that applies to all queries run in the account, no matter who
runs them. This means that if an account admin sets an account-level query tag, it will be
automatically applied to every query run by any user in that account.
Note: Only an ACCOUNTADMIN can set an account-level query tag.
To assign an account-level query tag in Snowflake, you can use the ACCOUNTADMIN role to
set the tag using the ACCOUNT_USAGE view.
Here's an example SQL query:
USE ROLE ACCOUNTADMIN;
ALTER ACCOUNT SET QUERY_TAG = 'Account_level_query_tag';

Set account-level query tag


User-Level Query Tags:
A user-level query tag is a way to apply a tag to the queries run by a specific user. This tag will
only be applied to the queries run by that user and not to any other queries. To set a user-level
query tag, you must have the "ALTER USER" privilege. This feature can be handy in situations
where you need to track queries for specific users or departments.
To assign a user-level query tag in Snowflake, you can use the ALTER USER command with
the SET QUERY_TAG parameter.
Here's a sample SQL query:
USE ROLE SYSADMIN;

ALTER USER Pramit SET QUERY_TAG = 'TeamA';


ALTER USER Preeti SET QUERY_TAG = 'TeamB';
ALTER USER Happy SET QUERY_TAG = 'TeamC';
ALTER USER Alice SET QUERY_TAG = 'TeamD';
ALTER USER John SET QUERY_TAG = 'TeamE';

Set user-level query tag


USE ROLE ACCOUNTADMIN;

SELECT user_name, role_name, query_tag


FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE query_tag = 'TeamA'
GROUP BY user_name, role_name, query_tag;

Selected specific query tags from query history


Session-Level Query Tags:
Other than account-level and user-level Snowflake query tags, Snowflake also allows for
session-level query tags. These unique Snowflake query tags are extremely useful for
categorizing and tracking queries related to specific tasks or issues. You can assign a session-
level query tag before running any important query, making it easier to locate them later if you
need to use Snowflake time travel.
For example, you could create a session-level query tag named "HighPriority" before initiating
object tagging in your account in order to group and reuse associated queries.
To assign a session-level query tag in Snowflake, you can use the ALTER SESSION command
with the SET QUERY_TAG parameter.
Here's an example SQL query:
ALTER SESSION SET QUERY_TAG = 'HighPriority';

Set session-level query tag


USE ROLE ACCOUNTADMIN;

SELECT user_name, role_name, query_tag


FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE query_tag = 'HighPriority'
Selected specific query tags from query history
Note: Session-level Snowflake query tags are unique to the current session and replace
user-level Snowflake query tags. So, by utilizing session-level Snowflake query tags,
you enhance your ability to efficiently identify queries related to specific tasks or
issues, which can be incredibly valuable when debugging queries or any issue.
What are the benefits of using Query Tags in Snowflake?
There are numerous benefits to using Snowflake query tags, some of which are listed below:

1) Snowflake Query Monitoring + Management 🕵️


Snowflake query tags make it easy to find and track queries based on criteria like department,
user, or purpose of the query. This makes it easier to keep an eye on and manage queries. This
info can be really valuable in detecting performance issues or optimizing resource allocation.
For instance, if a particular user or department is experiencing slow query times, Snowflake
query tags allow account admins to pinpoint the exact queries causing the problem and take
appropriate measures to improve overall Snowflake performance.

2) Optimizing Snowflake Query Performance 🚀


Snowflake query tags play a significant role while optimizing query performance. By tagging
queries, users can easily find and prioritize specific queries, allocating resources based on their
priorities and needs for faster processing and reduced wait times. This effective method for
optimizing Snowflake queries improves the whole querying process, making Snowflake queries
more efficient and better at what they do.

3) Streamlined Snowflake Query Tracking Process 🔍


Snowflake query tags is a versatile feature that simplifies the process of auditing queries by
enabling them to be organized according to specific criteria for categorization and easy retrieval.
So by tagging queries with specific labels/criteria, admins can quickly search for and identify
relevant information, which can help streamline the query auditing process and improve
compliance and security measures as well.
4 best practices for Using Query Tags in Snowflake
1) Establishing a consistent query tagging technique
It's ABSOLUTELY necessary to have a consistent tagging approach that aligns with your
business goals and query management processes. Creating a clear naming convention for your
tags is a good idea to make analysis and reporting easier and more useful.
2) Limiting the number of Snowflake query tags
When creating Snowflake query tags, it's crucial to strike a perfect balance between
comprehensiveness and manageability. Even though it's important to have enough tags to cover
the most important parts of query activity, having too many tags can make query data too big,
confusing, and hard to analyze. So, it's best to come up with a small set of query tags that cover
the most important parts of query activity.
3) Using Snowflake query tags to track essential metrics at all times
Snowflake query tags play an important role in monitoring and optimizing query performance.
You can put these descriptive labels on queries to keep track of key metrics like resource usage,
run-time, queue time by groups. Hence, by analyzing these trends, you can easily identify query
groups that need optimization. For instance, to find queries that use too much CPU or take too
long to run you can use tags to find out which users or applications run these queries. With this
valuable insight, you can better understand where and how your resources are being utilized and
make informed decisions about Snowflake query optimization.
4) Using Snowflake query tags for compliance + auditing purposes
Snowflake query tags make tracking and monitoring queries for compliance and auditing
purposes easy and quick. You can use Snowflake query tags to find queries that access sensitive
data or do important tasks so that they can be thoroughly audited. These tags make it easier to
identify anomalies or potential issues requiring attention.
Conclusion
Query performance tracking can be daunting, particularly when dealing with millions of queries.
Snowflake query tags, fortunately, provide a solution to this problem by allowing users to label
their workloads. These labels aid in categorizing and aggregating queries for simple monitoring
and optimizing platform resources, resulting in better utilization and reduced costs.
In summary, Snowflake query tags enable users to enhance their Snowflake query management
workflow while also enhancing query performance and significantly reducing Snowflake costs.
Start tagging those queries and feel like a boss who can easily find a needle in a haystack, or in
this case, a query in your Snowflake.

FAQs
How are query tags different from object tags?
Query tags and object tags serve different purposes in Snowflake. Query tags are used for
tagging SQL statements to enable better cost and performance monitoring. Object tags, on the
other hand, are used for persistent account objects like users, roles, tables, views, and functions.
Why should I use query tags in Snowflake?
Query tags provide enhanced cost attribution and performance monitoring. They allow for fine-
grained cost attribution by associating costs with specific queries or sets of related queries.
Can query tags be used with dbt (data build tool)?
Yes, query tags can be used with dbt.
Can query comments be used instead of query tags?
Yes, query comments can be used to add metadata to queries, but they have some limitations.
Query tags are simpler to parse and analyze downstream, while query comments may require
additional processing. Query comments also have a size limit of 1MB, whereas query tags are
limited to 2000 characters.
How do you set a query tag in Snowflake?
To set a query tag in Snowflake, you can use the "ALTER USER" command with the SET
QUERY_TAG parameter. This allows you to assign a tag to specific users.
5 Tips for Snowflake query optimization using Snowflake Query Profile
Understand the Snowflake Operators
To increase the efficiency of Snowflake queries, it is important to understand the roles each
operator plays in the query pipeline and how different operators interact with one another. So, if
you understand certain patterns and look at the results of each step, you can find places to
improve to make the Snowflake query optimization process faster and more efficient.
Identify Performance Issues
Use the Query Profile to see which steps of the query pipeline are consuming the most time and
concentrate your Snowflake query optimization efforts there. After you've identified potential
performance bottlenecks, look into why some operators are taking longer than expected or using
too many resources. And once these issues have been identified, you can take corrective action
to increase query efficiency and minimize overall runtime.
Search for long-running stages
When analyzing the query profile, ALWAYS ALWAYS pay attention to stages that require
more processing time, as these are often the most likely candidates to experience slowdowns. So
by pinpointing the stages that need improvement, you can then focus your Snowflake query
optimization efforts on them and work towards improving overall Snowflake query performance.
"Query Details" to identify bottlenecks
"Query Details" section of the query profile provides additional information about the data
involved in the query, including the number of rows and bytes processed at each stage. Use these
stats to identify bottlenecks in the query. For instance, if you see that a particular stage is
processing a large number of rows, you may want to consider optimizing the query to reduce the
amount of data processed.

Query details dashboard


“SQL Text" to analyze the SQL code
"SQL Text" section of the query profile provides the actual SQL code that was executed. Make
use of this section to analyze the SQL code and identify potential Snowflake query optimization.
For instance, you may be able to optimize the query by rewriting it to use more efficient joins or
filters to achieve better results.

SQL text section


Conclusion
Snowflake query optimization is important for any business that wants to improve the
performance of their Snowflake queries, lower the costs of their data warehouses, and save
serious cash. With Snowflake's Query Profile feature, you'll get a detailed view of your queries'
performance and know exactly, with pinpoint accuracy, what needs to be optimized. This article
covered everything you need to know about the Snowflake query profile and how to optimize it,
as well as the best tips and tricks to get the most out of Snowflake and how to optimize its query
performance and get the best results.
So, if you want to make sure your queries are running at their best, give Snowflake Query
Profile a try today!

FAQs
What information does Snowflake Query Profile provide?
Snowflake Query Profile provides information such as query runtime stats, query execution
stats, query resource usage, and query error/warning messages.
How do you profile a query in a Snowflake?
Snowflake Query Profile can be accessed through Snowsight or Classic UI by navigating to the
Activity or History sections and selecting the desired query. Also, you can use the
GET_QUERY_STATS function to retrieve query statistics programmatically.
How can I analyze a query using Snowflake Query Profile?
Snowflake Query Profile offers a Query Graph View and Query Stats Panel. The Query Graph
View visualizes the data pipeline flow, while the Query Stats Panel provides performance
metrics and information about each operator in the pipeline.
When should I use the Snowflake Query Profile?
The Query Profile should be used when you need more diagnostic information about a query,
such as understanding its performance or identifying bottlenecks.
What are some things to look for in the Snowflake Query Profile?
Some indicators of poor query performance in the Query Profile include high spillage to remote
disk, a large number of partitions scanned, exploding joins, Cartesian joins, operators blocked by
a single CTE, unnecessary early sorting, repeated computation of the same view, and a very
large Query Profile with numerous nodes.
What is the purpose of the Snowflake query profile?
The Snowflake query profile provides execution details for a query, offering a graphical
representation of the processing plan's components, along with statistics for each component and
the overall query.
What is an example of a Query_tag in Snowflake?
Query tags in Snowflake are session-level parameters that can have defaults set at the account
and user level. For example, you can set a default query tag like '{"DevTeam": "software",
"user": "John"}', and every query issued by that user will have this default tag.
How do you write an efficient Snowflake query?
To write efficient queries in Snowflake, you should focus on scan reduction (limiting the volume
of data read), query rewriting (reorganizing a query to reduce cost), and join optimization
(optimally executing joins).
How do you write efficient Snowflake queries?
Follow these best practices:
 Use Snowflake Query Profile tool.
 Choose the right-sized virtual warehouse.
 Maximize caching.
 Leverage materialized views.
 Optimize data clustering and micro-partitioning.
 Utilize Snowflake Query Acceleration.
 Consider using the Search Optimization Service.
free trial!
Get started for free
(No credit card required)

Snowflake Storage Costs 101 - An In-Depth Guide to Manage and Optimize Storage Costs

In this article, we will provide valuable insights and techniques for reducing Snowflake storage
costs. We'll take a closer look at how Snowflake storage costs work—and point out specific key
areas where you can clean it up. This article is suitable for both new and experienced Snowflake
users, covering key areas for Snowflake cost optimization and best practices. By the end of this
article, you'll have a better understanding of Snowflake storage costs and the tools & techniques
needed for improving your Snowflake ROI.
How Do Snowflake Storage Costs Work?
Before we do a deep dive into the intricacies of understanding and optimizing Snowflake storage
costs, let's first understand its "multi-cluster, shared data" design architecture - separating
storage, compute, and cloud services. This design allows Snowflake to logically integrate these
components while physically separating them. It differs from traditional data warehouses, where
storage, compute, and cloud services are tightly coupled. Let's delve further into the design of
a multi-cluster, shared data system, focusing on its three components:
 Storage: This is the persistent storage layer for data stored in Snowflake. It resides in a
scalable cloud storage service(such as GCS or S3) that ensures data replication, scaling,
and availability without customer management. Snowflake optimizes and stores data in a
columnar format within this storage layer, organized into databases as specified by the
user.
 Compute: This is a collection of independent compute resources that execute data
processing tasks required for queries.
 Cloud Services: This is a collection of system services that handle infrastructure,
security, metadata, and optimization across the entire Snowflake.

Snowflake Architecture (Source: docs.snowflake.com)


Snowflake storage costs are an important factor to consider when using it. The monthly costs for
storing data in Snowflake are based on a flat rate per terabyte (TB) of consumption (after being
compressed). The amount charged is determined by your account type (capacity or on demand),
Cloud Platform, and the region, whether in the United States, Asia, or Europe.
For example, the Capacity storage rate is ~$23 per terabyte per month for customers located in
the United States and ~$24.50 per terabyte per month for customers located in the European
Union region. The cost of Snowflake Credits for capacity purchases is determined upon ordering
and is actually based on the size of the total committed customer purchase. On the other hand, if
you decide to use an Snowflake’s On Demand Pricing, the storage rate for AWS – US East
(Northern Virginia) customers is $40 per terabyte per month, whereas it is $45 dollar per
terabyte per month for customers who deploy in the EU region.
Note: Snowflake storage costs can vary depending on the type of data you are storing
and the duration for which you keep it.
There are several types of storage costs in Snowflake, such as:
Staged file storage cost in Snowflake: It refers to the costs associated with storing files for bulk
data loading or unloading in Snowflake. In Snowflake, data loading and unloading are typically
done by staging the data in a file format, such as CSV or JSON, before loading it into or
unloading from the Snowflake database. When a file is staged, it calculates storage costs based
on the size of the file. This means Snowflake's cost fluctuates based on the size of the data.
Snowflake does not charge data ingress fees to bring data into your account but does charge for
data egress [5].
Note: Data ingress fees are free of charge, however, utilizing Snowpipe or executing
COPY or INSERT queries to load data will incur a compute cost.
Take a look at this guide on loading data in Snowflake.
Database storage cost in Snowflake: It refers to the costs associated with storing data in
Snowflake's databases. It includes the data stored in the database tables and any historical data
maintained for Time travel. So, how does time travel work in Snowflake? Time travel is a
feature in Snowflake that allows users to query their data as it existed at a specific point in time.
This feature is used to recover lost data, track changes in data over time, or for compliance and
auditing purposes. Historical data stored for Time Travel is also included in the database costs.
Snowflake automatically compresses all data stored in tables, which helps optimize Snowflake
storage used for an account. The compressed file size is used to calculate the total storage used
for an account, and this is used to determine the costs associated with the data stored in the
databases. The compressed file size is smaller than the original file size, so the cost will
eventually be lower.
Time Travel and Fail-safe Costs: These refer to the costs associated with maintaining historical
data and ensuring data integrity in Snowflake. These costs are calculated per day based on the
amount of data that is changed during that time period.
Time Travel Storage Costs in Snowflake:
 Snowflake’s Time Travel feature allows users to query their data as it existed at a
specific point in time, and it also provides Snowflake query optimization and query cost
reduction benefits.
 The number of days historical data is maintained is based on the table type and the Time
Travel retention period for the table.
 Snowflake minimizes the amount of storage required for historical data by maintaining
only the necessary information to restore the individual table rows that were updated or
deleted.
Fail-safe Storage Costs in Snowflake:
 Snowflake’s Fail-safe features allows one to recover data even after the time travel
period has ended. It is meant to be used in extreme cases, and it can only be accessed by
Snowflake.
 The costs associated with Fail-safe are calculated based on the amount of data that is
stored in multiple copies.
 Snowflake only maintains full copies of tables when tables are dropped or truncated.
Zero-copy cloning cost: It refers to the costs associated with creating a copy of a database,
schema, or table along with its associated data in Snowflake without incurring any additional
storage charges until changes are made to the cloned object. It is because zero-copy cloning is a
metadata-only operation [4]. For example, Snowflake storage costs charges will be applied at
that point if a user clones a database and then adds a new table or deletes rows from a cloned
table. Zero-copy cloning has many uses beyond creating backups, such as supporting the
development and test environments.
Monitoring Snowflake Storage Costs Using Snowsight.
Monitoring data storage is an important aspect of Snowflake cost reduction and ensuring that
your data is stored efficiently. As an ACCOUNTADMIN, you have the ability to view data
storage across your entire account as well as for individual tables [1].
Account-wide Storage:
 As an admin, you can use Snowsight or even the classic web interface to view data
storage across your entire account. But for the sake of this article, we will be using
Snowsight to do it.
 In Snowsight, navigate to Admin > Usage > Storage.
 This page displays the total average Snowflake data storage for your account and all
databases, stages, and data in fail-safe.

Monitoring Account-wide storage usage


Individual Table Storage:
 Any user with the appropriate privileges can view the data storage for individual tables.
 In the Classic Web UI or Snowsight, navigate to Databases > “db_name” > Tables.
 This will show the storage usage for each table in the selected database.

Individual Table storage


Note: Keep an eye on how much data storage you're using so you're aware of any
unexpected changes. This will help you find and fix any problems that might be
affecting the efficiency of your data storage and help you get the most out of your
storage costs.
To gain insight into historical Snowflake storage costs, users can access
the ACCOUNT_USAGE and ORGANIZATION_USAGE schemas. These schemas provide
detailed information about storage costs, including the amount of data stored, the duration of
time the data was stored, and the cost associated with storing the data.
Viewing Snowflake Storage History for Your Organization
As we have previously discussed, the total Snowflake storage cost is comprised of various costs
associated with different types of storage, including staged file storage, database table storage,
and fail-safe and time-travel storage. To better understand Snowflake’s storage costs for your
organization, users with the ACCOUNTADMIN role can use Snowsight to view the amount of
data stored in Snowflake.
To access this storage info, you must navigate to the Admin section and select Usage from the
drop-down menu.

Admin section and Usage dropdown menu


From there, select Storage from the Usage Type drop-down.
Usage section and storage dropdown menu
It is essential to select a virtual warehouse when using the dashboard to view cost and usage
information from the shared snowflake database, as it consumes compute resources when
retrieving this data, Snowflake recommends using an X-Small(Extra Small) warehouse for this
purpose.

Warehouse selection
Warehouse size
Filter by Tag
Usage dashboard in your organization allows you to filter storage usage by a specific tag/value
combination. This feature is designed to enable you to attribute Snowflake costs to specific
logical units within your organization. Using tags, you can easily identify and track the storage
usage for different departments, projects, or any other logical units you have defined in your
organization. This feature is similar to filtering credit consumption by tag, enabling you to track
and attribute costs to specific units within your organization. In both cases, using tags allows for
easy tracking and cost attribution, making it easy to understand and manage your organization's
storage and credit usage.

Filter by Tag
View Snowflake Storage by Type or Object
The bar graph on the Usage dashboard of your organization allows you to filter Snowflake
storage usage data either By type or By object.

Filter storage usage data


Filtering By Type, you can view the size of storage for each Snowflake storage category, such as
Database, Fail Safe, and Stage. Additionally, storage associated with Time Travel is included in
the Database category, providing a comprehensive view of storage usage by type[2]. This feature
is useful for understanding the breakdown of storage usage across different storage categories
and can help identify areas where Snowflake storage usage is particularly high or low.
Filtering By Object, the graph displays the size of storage for each specific object. For example,
you can view the size of a particular database or stage [2]. This feature allows you to drill down
and see the detailed storage usage for specific objects, which can be useful for identifying and
addressing storage usage issues at the individual object level.
Viewing Data Usage for a Table in Snowflake
Snowsight allows users with the appropriate access privileges to view the size (in bytes) of
individual tables within a schema or database. This feature is useful for understanding the
storage usage of specific tables and can help identify and address storage usage issues at the
table level.
To view the size of a table using Snowsight, follow these steps:
 Select Data > Databases from the main menu.
 On the left side of the Databases page, use the database object explorer to view the
database and schema.
 Expand the database and any schema in the database to view the tables within it.
 Click on any table to view its statistics, including its size.
Data section and Databases dropdown menu

Database object explorer


Tables in a database

Table size
Query for Table Size in Snowflake
It is possible to gain insights into tables, including their size, by writing SQL queries, instead of
using the Snowsight web interface because this allows users for more flexibility + automation
when working with large amounts of data [3]. A user with the proper access/privileges can list
data about tables using the SHOW TABLES command.
For example, the following query will list all tables in a specific schema:
SHOW TABLES IN SCHEMA your_schema;
Querying Data for Table Size
Let's go a bit further. So now that we know what the SHOW TABLES command does,it's time to
dive deep into actually calculating the size of the individual table so users with
the ACCOUNTADMIN role can use SQL to view table size information by executing queries
against the TABLE_STORAGE_METRICS view in the ACCOUNT_USAGE schema. This view
shows the table-level storage utilization info, which can be utilized to compute the storage
billing for every table in the account, including those that have been removed but are still
consuming the storage fees. TABLE_STORAGE_METRICS contains storage information about
all tables that a particular account holds and tons of information about table size, including the
number of bytes currently being stored, the number of bytes that have been stored historically—
and more. For a quick example, here is a sample query that will calculate and return the total
table size in GB for the table DATASETS which is present
in REAL_ESTATE_DATA_ATLAS database:
USE role ACCOUNTADMIN;
SELECT TABLE_NAME,
TABLE_SCHEMA,
Sum(((active_bytes + time_travel_bytes + failsafe_bytes +
retained_for_clone_bytes)/1024)/1024)/1024 AS TOTAL_STORAGE_USAGE_IN_GB
FROM REAL_ESTATE_DATA_ATLAS.INFORMATION_SCHEMA.TABLE_STORAGE_METRICS
WHERE TABLE_NAME in ("DATASETS");
GROUP BY 1,2
ORDER BY 1,2,3
Note: To query the TABLE_STORAGE_METRICS view, the user must
have ACCOUNTADMIN privilege and there might be 1-2 hour delay in updating the
storage related stats for active_bytes + time_travel_bytes + failsafe_bytes +
retained_for_clone_bytes in this view.
Query for Snowflake Storage Costs: ORGANIZATION_USAGE and ACCOUNT_USAGE Schemas
So before we even start to query the data for Snowflake storage cost, we need to understand that
Snowflake provides two schemas that make it possible to query storage
cost: ORGANIZATION_USAGE and ACCOUNT_USAGE. These schemas contain data related to
usage and cost and provide very detailed and granular, analytics-ready usage data to build
custom reports or even dashboards.
ORGANIZATION_USAGE schema contains data about usage and costs at the organization
level, which includes data about storage usage and cost for all the databases and stages within an
organization in a shared database named SNOWFLAKE. Only the users with
the ORGADMIN or ACCOUNTADMIN role can access the ORGANIZATION_USAGE schema.
For more information on the ORGANIZATION_USAGE views, please refer to this.
ACCOUNT_USAGE schema contains data about usage and costs at the account level, which
includes data about storage usage and cost for all the tables within an account.
For more information on the ACCOUNT_USAGE views, please refer to this.
Most views in these two schemas contain the cost of storage in terms of storage size. But, if you
want to view the cost in currency instead of size, you can write queries using
the USAGE_IN_CURRENCY_DAILY View. This view converts the storage size into actual
currency using the daily price of a TB allowing users to easily understand the actual cost of
storage in terms of currency.
SELECT
DATE_TRUNC('day', USAGE_DATE) AS "Usage Date",
SUM(USAGE) AS "Total Usage",
SUM(USAGE_IN_CURRENCY) AS "Total Cost"
FROM
ORGANIZATION_USAGE.USAGE_IN_CURRENCY_DAILY
WHERE
ORGANIZATION_NAME = 'YourOrganizationName'
AND USAGE_TYPE = 'YourUsageType'
AND USAGE_DATE BETWEEN 'start_date' AND 'end_date'
AND CURRENCY = 'CurrencyOfTheUsage'
GROUP BY
1
ORDER BY
1 DESC
LIMIT
10;
This particular query returns the total usage and total cost for a specific organization, usage type,
and date range, grouped by day and ordered by decreasing usage date, with a limit of 10 rows.
Snowflake Storage Cost Reduction Techniques for Maximizing Snowflake ROI
Let's talk about some ways to keep Snowflake storage costs as low as possible. We've already
talked about how storage, compute, and services are the three main things that make up
Snowflake costs. Snowflake's pricing is based on how much you use it, so you only pay for what
you use. It's really very important to keep an eye on certain areas to reduce storage costs, as you
may be BURNING up space unnecessarily. Here are some important places to keep an eye on to
find ways to clean up and cut down on storage costs.
1. Staged files for bulk data loading and unloading consume storage space. Frequently
check the STORAGE_USAGE view in ACCOUNT_USAGE, which might give an idea
of how much space is consumed by stages.
2. Unused Tables: Use Snowflake's ACCESS_HISTORY object to find how frequently the
objects are queried and drop those that haven't been used in a certain period of time.
3. Time travel Costs: Time travel is a unique feature of Snowflake that retains deleted or
updated data for a specified number of days, but it is important to note that this stored
history data encounters the same cost as active data. Hence, always be careful, make sure
to check it, and consider the data criticality and table churn rate when setting up time-
travel retention periods.
4. Transient Tables: Consider having a transient database/schema for environments where
data has low importance, such as staging environments. This will help control storage
costs and avoid unnecessary data retention.
Conclusion
As with any cloud-based service, you should ALWAYS ALWAYS keep a close eye on your
Snowflake storage costs to make sure you're not spending too much on resources you don't
actually need. By understanding the different storage costs and using tools and techniques like
Snowsight (Snowflake web user interface) and custom queries, users can get valuable
information about the storage costs. Using the tips and tricks outlined in this article, you can
keep your reduce your Snowflake storage costs to a minimum and improve your Snowflake ROI.

FAQs
How storage cost is calculated in Snowflake?
Snowflake storage costs are based on a flat rate per terabyte of consumption (after compression)
and vary depending on the account type (capacity or on demand), Cloud Platform, and region.
What is staged file storage cost in Snowflake?
Staged file storage cost refers to the cost associated with storing files for bulk data loading or
unloading in Snowflake.
How can I monitor Snowflake storage costs?
Snowflake provides tools like Snowsight to monitor storage usage at the account-wide and
individual table levels. Users can also query the ACCOUNT_USAGE and
ORGANIZATION_USAGE schemas to gain detailed information about storage costs.
How much does Snowflake storage cost for 1 TB?
The cost of storing 1 TB of data in Snowflake varies based on factors such as account type,
Cloud Platform, and region. For Capacity storage in the United States, it's around ~$23/month,
while in the EU, it's approximately ~$24.50/month. With On Demand Pricing, it's ~$40/month in
the US and ~$45/month in the EU.
Snowflake Data Transfer Costs 101 - An In-Depth Guide to Manage and Optimize Data Transfer
Costs

Snowflake, a popular cloud data platform, provides users with superior scalability, low latency,
advanced analytics and flexible pay-per-use pricing—but managing its data transfer costs can be
very daunting and hectic for businesses. To help get the most value out of Snowflake, it's
important to understand how Snowflake data transfer costs work, including both data ingress and
egress costs.
In this article, we'll explain Snowflake's data transfer costs, including data ingress and egress
fees, how they vary depending on cloud providers and regions, and how to optimize Snowflake
costs and maximize ROI.
Understanding Snowflake Data Transfer Costs
Snowflake calculates data transfer costs by considering criteria such as data size, transfer rate,
and the region or cloud provider from where the data is being transferred. Before tackling how
to reduce these costs, it is important to understand how Snowflake calculates them.
Snowflake Data Ingress and Egress Costs
Snowflake data transfer cost structure depends on two factors: ingress and egress.
Ingress, meaning transferring the data into Snowflake, is free of charge. Data egress, however,
or sending the data from Snowflake to another region or cloud platform incurs a certain fee. If
you transfer your data within the same region, it's usually free. When transferring externally
via external tables, functions, or Data Lake Exports, though, there are per-byte charges
associated that can vary among different cloud providers.
Take a look at some Snowflake pricing tables for comparison to get an idea of what the
difference in cost could look like:
Snowflake Data Transfer Costs in AWS:

Snowflake data transfer costs in AWS


Snowflake Data Transfer Costs in Microsoft Azure:

Snowflake data transfer costs in Microsoft Azure


Snowflake Data Transfer Costs in GCP

Snowflake data transfer costs in GCP


Snowflake Features that Triggers Data Transfer Costs
Transferring data from a Snowflake account to a different region within the same cloud platform
or to a different cloud platform using certain Snowflake features comes with Snowflake data
transfer costs. These costs may be triggered by several activities, such as:
1) Unloading Data in Snowflake
Unloading data in Snowflake is the process of extracting information from a database and
exporting it to an external file or storage system such as Amazon S3, Azure Blob or Google
Cloud Storage. This allows users to make the extracted data available for use in other
applications, whether for analytics, reporting, or visualizations. Unloading can be done manually
or automatically; however, it's important to consider factors such as data backup frequency, data
volume, storage, and transfer costs first.
To unload data from Snowflake to another cloud storage provider, the COPY INTO
<location> command is used. This requires setting up a stage associated with certain costs. It's
possible to unload the data to a location different from where the Snowflake account is hosted.
The complete syntax for the command is as follows:
COPY INTO { internalStage | externalStage | externalLocation }
FROM { [<namespace>.]<table_name> | ( <query> ) }
[ PARTITION BY <expr> ]
[ FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |
TYPE = { CSV | JSON | PARQUET }
[ formatTypeOptions ] } ) ]
[ copyOptions ]
[ VALIDATION_MODE = RETURN_ROWS ]
[ HEADER ]
Let’s chunk down this syntax line by line to understand it better.
To indicate where the data will be unloaded, use the COPY INTO command followed by the
stage location, which can be internal or external. For instance:
Internal or external stage: COPY INTO @your-stage/your_data.csv

Google Cloud Bucket: COPY INTO 'gcs://bucket-name/your_folder/'

Amazon S3: COPY INTO 's3://bucket-name/your_folder/'

Azure: COPY INTO


`azure://account.blob.core.windows.net/bucket-name/your_folder/'
The FROM clause specifies the source of the data, which can be a table or a query. For example:
Copy entire table: FROM SNOWFLAKE_SAMPLE_DATA.USERS

Copy using query: FROM (SELECT USER_ID, USER_FIRST_NAME, USER_LAST_NAME,


USER_EMAIL FROM SNOWFLAKE_SAMPLE_DATA.USERS)
When unloading to an external storage location, you need to specify
the STORAGE_INTEGRATION parameter, which contains authorization for Snowflake to write
data to the external location.
The PARTITION BY parameter splits the data into separate files based on a string input, while
the FILE_FORMAT parameter sets the file type and compression. The copyOptions parameter
includes optional parameters that can be used to further customize the results of the COPY
INTO command.
To ensure data consistency, consider creating a named file format where you can define file
type, compression, and other parameters upfront. The VALIDATION_MODE parameter is used
to return the rows instead of writing to your destination, allowing you to confirm that you have
the correct data before unloading it.
Lastly, the HEADER parameter specifies whether you want the data to include headers.
Example query:
COPY INTO 's3://bucket-name/your_folder/'
FROM (SELECT USER_ID, USER_FIRST_NAME, USER_LAST_NAME, USER_EMAIL FROM
SNOWFLAKE_SAMPLE_DATA.USERS)
STORAGE_INTEGRATION = YOUR_INTEGRATION
PARTITION BY LEFT(USER_LAST_NAME, 1)
FILE_FORMAT = (TYPE = 'CSV' COMPRESSION = 'AUTO')
DETAILED_OUTPUT = TRUE
HEADER = TRUE;
Note: Always ensure that the data has been unloaded successfully at the destination
and take appropriate measures to prevent any potential data leaks.
2) Replicating Data in Snowflake
Replicating data is the process of creating a copy of data and storing it in a separate location.
This process can be highly valuable for disaster recovery or for providing read-only access for
reporting purposes. There are different methods of database replication that can be used, such as
taking an up-to-date data snapshot of the primary database and copying it to the secondary
database.
For Snowflake accounts, replication can involve copying data to another Snowflake account
hosted on a different platform or region than the origin account. This approach helps ensure data
availability, even in case of an outage or other issues with the primary account. Also, replicating
data to different locations can bring benefits like lower latency and improved performance for
users accessing data from various regions worldwide.
3) Writing external functions in Snowflake
External functions in Snowflake refer to user-defined functions that are not stored within the
Snowflake and are instead executed outside of it. This feature allows users to easily access
external application programming interfaces (APIs) and services, including geocoders, machine
learning models, and custom code that may be running outside the Snowflake environment.
External functions eliminate the need to export and re-import data when accessing third-party
services, significantly simplifying data pipelines. This feature streamlines the data flow by
allowing Snowflake to access data from external sources and then process it within the
Snowflake environment. External functions can also enhance Snowflake's capabilities by
integrating with external services, such as machine learning models or other complex
algorithms, to perform more advanced data analysis. This integration is made possible by
allowing Snowflake to send data to an external service, receive the results, and then process
them within the Snowflake environment.
We will delve deeper into this topic in our upcoming article, but in the meantime, you can refer
to this documentation to know more about it.
Delving into Snowflake Data Transfer Cost
By now, you may be aware that importing data into your Snowflake account is free of cost, but
there is a per-byte fee to transfer data across regions on the same cloud platform or to a different
cloud platform. To gain a more thorough understanding of historical data transfer costs, users
can make use of Snowsight, the Snowflake web interface, or execute queries against
the ACCOUNT_USAGE and ORGANIZATION_USAGE schemas. Snowsight provides a visual
dashboard for obtaining cost overviews quickly. If you require additional in-depth information,
you may run the queries against usage views to analyze cost data in greater depth and even
generate custom reports/dashboards.
How to access Snowflake Data Transfer Costs?
In Snowflake, only the account administrator (a user with the ACCOUNTADMIN role) has
default access to view cost and usage data in Snowsight, ACCOUNT_USAGE schema, and
the ORGANIZATION_USAGE schema. However, if you have a USERADMIN role or higher, you
can grant access to other users by assigning them SNOWFLAKE database roles .
The following SNOWFLAKE database roles can be used to access cost and usage data:
 USAGE_VIEWER : This role provides access to cost and usage data for a single account
in Snowsight and related views in the ACCOUNT_USAGE schema.
 ORGANIZATION_USAGE_VIEWER : If the current account is the ORGADMIN account,
this role provides access to cost and usage data for all accounts in Snowsight, along with
views in the ORGANIZATION_USAGE schema that are related to cost and usage but
not billing.
So, by default, only the account administrator has access to cost and usage data.
However, SNOWFLAKE database roles can be used to give other users who need to view and
analyze cost and usage data access to the data.
How to view the overall Snowflake Data Transfer Cost using Snowsight?
As we have already mentioned above, the account administrator( a user with
the ACCOUNTADMIN role) can only use Snowsight to obtain an overview of the overall cost of
using Snowflake for any given day, week, or month.
Here are the steps to use Snowsight to explore the overall cost:
 Navigate to the "Admin" section and select "Usage"
Admin section and usage dropdown menu
 From the drop-down list, select "All Usage Types."

Usage dashboard
How to Query Data for Snowflake Data Transfer Cost?
Before we begin querying the data for Snowflake data transfer costs, we must first understand
that Snowflake has two schemas for querying data transfer
costs: ORGANIZATION_USAGE and ACCOUNT_USAGE. These schemas contain data related
to usage and cost and provide very detailed and granular, analytics-ready usage data to build
custom reports or even dashboards.
 ORGANIZATION_USAGE schema contains data about usage and costs at the
organization level, including storage usage and cost data for all organizational databases
and stages in a shared database named SNOWFLAKE. The ORGANIZATION USAGE
schema is only accessible to users with the ORGADMIN or ACCOUNTADMIN roles.
 ACCOUNT_USAGE schema contains usage and cost data at the account level, including
storage consumption and cost for all tables inside an account.
Most views in the ORGANIZATION_USAGE and ACCOUNT_USAGE schemas show how
much it costs to send data based on how much data is transferred. To view cost in currency
rather than volume, write queries against the USAGE_IN_CURRENCY_DAILY View. This view
transforms the amount of data transferred into a currency cost based on the daily price of
transferring a TB.
The table below lists the views that contain usage and data transfer costs information from your
Snowflake account to another region or cloud provider. These views can be used to gain insight
into the costs associated with data transfer.
Views Description

ORGANIZATION_USAGE.DATA_TRANSFER_DAILY_HISTO Provides the number of


RY bytes transferred on a
given day.

ORGANIZATION_USAGE.DATA_TRANSFER_HISTORY Provides the number of


bytes transferred, along
with the source and target
cloud and region, and type
of transfer.

ACCOUNT_USAGE.DATABASE_REPLICATION_USAGE_HIST Provides the number of


ORY bytes transferred and
credit consumed during
database replication.

ORGANIZATION_USAGE.REPLICATION_USAGE_HISTORY Provides the number of


bytes transferred and
credits consumed during
database replication.

ORGANIZATION_USAGE.REPLICATION_GROUP_USAGE_HI Provides the number of


STORY bytes transferred and
credits consumed during
replication for a specific
replication group.

ORGANIZATION_USAGE.USAGE_IN_CURRENCY_DAILY Provides daily data


transfer in TB, along with
the cost of that usage in
the organization's
currency.
The table above shows different Snowflake views about how data transfer is used and how much
it costs, along with their descriptions and schema information. With these views, you can get
more granular details about how Snowflake data transfer works and how much it costs.
5 Best Practices for Optimizing Data Transfer Costs
Searching for strategies to reduce Snowflake data transfer costs? Knowing how the platform
calculates such costs is a good start. But there are also best practices that should be kept in mind
if you want to save massively on Snowflake data transfer costs, such as:
1) Understanding Snowflake Data Usage Patterns
Understanding your data usage patterns is one of the most effective ways to optimize Snowflake
data transfer costs. By carefully analyzing how you access and use data, you can identify points
where reducing transfers will not reduce the data quality. This can help to bring down transfer
costs significantly.
For example, If you discover that certain datasets are not being accessed as often, you can
reduce the frequency of their data transfers, thereby lowering transfer costs while preserving
data integrity. So by utilizing this method, you can efficiently cut down on data transfer costs
while preserving the availability of your data.
2) Reducing Unnecessary Snowflake Data Transfer
Reducing unnecessary data transfers is another way to optimize Snowflake data transfer costs.
This can be achieved by minimizing data duplication and consolidating/merging datasets where
possible. Doing so will help reduce the amount of data you send, keeping your transfer costs
low, thus cutting down overall costs.
For example, if you have multiple copies of the same dataset stored in separate places, you can
merge those copies to save on data transfer costs. Similarly, if you have datasets that are
duplicated across multiple accounts, you can merge those accounts to reduce data transfer costs.
3) Using Snowflake Data Compression and Encryption Features
Snowflake offers built-in data compression and encryption features that can help you reduce data
transfer costs. Data compression helps minimize the amount of data that needs to be transferred,
while encryption ensures that your data is secure during transit. Encryption can also help you
comply with industry regulations and prevent potential data breaches. By leveraging these cost-
saving features, users can make sure that their data is transferred safely and securely without
breaking the bank.
4) Optimizing your Snowflake data transfer using Snowpipe
Snowpipe is a Snowflake feature that can help you significantly reduce data transfer costs. This
one-of-a-kind loading process seamlessly loads streaming data into your Snowflake account,
removing the need to preprocess the data before transferring it. Snowpipe also makes use of
Snowflake's auto-scaling capability to optimally allocate compute resources as needed to process
incoming data – helping you maximize efficiency while still cutting costs.
5) Monitoring your data transfer costs and usage
Keeping a close eye on and monitoring your Snowflake data transfer costs is essential for
optimizing performance and managing your budget. Snowflake gives you access to
comprehensive reports on data transfers, indicating the amount of data transferred, where it
moved from and to, and associated costs. This detailed information can be used to control your
costs and change how you move data around.
Conclusion
Knowing how to reduce your Snowflake data transfer costs is key to a successful strategy to
keep your Snowflake data transfer costs as low as possible. In this guide, we have provided you
an overview of what is a Snowflake data transfer cost, how to calculate it, and tips on how to
minimize the costs through proactive measures. Armed with these practices, you can make
effective decisions for cost optimization and keep your Snowflake data transfer costs LOW!

FAQs
What factors influence the cost of Snowflake data transfer?
Snowflake data transfer cost structure depends on two factors: ingress and egress.
Is data ingress free in Snowflake?
Yes, transferring data into Snowflake (data ingress) is generally free of charge.
Does Snowflake charge for data egress?
Yes, there are fees associated with data egress in Snowflake. Sending data from Snowflake to
another region or cloud platform incurs a certain fee.
Will I incur data transfer costs if I move data within the same region?
No, there are no data transfer costs when moving data within the same region in Snowflake.
Can I dispute an unexpected spike in charges?
Yes, you can contact Snowflake support to review and dispute any unexpected data transfer
charges on your Snowflake bill
Snowflake Compute Costs 101 - An In-Depth Guide to Manage & Optimize Compute Costs

Snowflake, the leading cloud data platform widely used by businesses, is celebrated for its
unique pricing model and innovative architecture. Users are only charged for the compute
resources, storage and data transfer they use, enabling them to save huge on Snowflake costs.
However, this could lead to cost escalations if usage is not monitored carefully. To avoid such
issues and get maximum effectiveness out of Snowflake pricing, it is crucial to understand
Snowflake compute costs—the cost of using virtual warehouses, serverless computing and other
cloud services in Snowflake. These costs depends on the amount of Snowflake compute
resources used and the duration of their use; proper optimization might help businesses cut down
on their spending without adversely sacrificing Snowflake performance. This article provides
actionable tips and real-world examples to help you better understand Snowflake compute costs
and maximize your Snowflake ROI.
Understanding Snowflake Compute Costs
Snowflake compute costs are easy to comprehend, but before we even dive into that topic, we
need to get a decent understanding of what a Snowflake credit means.
Snowflake Credit Pricing
Snowflake credits are the units of measure used to pay for the consumption of resources within
the Snowflake platform. Essentially, they represent the cost of using the various compute
options available within Snowflake, such as Virtual Warehouse Compute, Serverless Compute,
and Cloud Services Compute. If you are interested in learning more about Snowflake pricing and
Snowflake credits pricing, you can check out this article for a more in-depth explanation.
Snowflake Virtual Warehouse Compute
Virtual warehouse in Snowflake is a cluster of computing resources used to execute queries, load
data and perform DML(Data Manipulation Language) operations. Unlike other traditional On-
Premise types of databases, which rely entirely on a fixed MPP(Massively Parallel
Processing) server, Snowflake's virtual warehouse is a dynamic group of virtual database servers
consisting of CPU cores, memory and SSD that are maintained in a hardware pool and can be
deployed quickly and effortlessly [1].
Snowflake credits are used to measure the processing time consumed by each individual virtual
warehouse based on the number of warehouses used, their runtime, and their actual size.
There are mainly two types of virtual warehouses:
1. SnowStandard Virtual Warehouse
2. Snowpark-optimized Virtual Warehouse
Standard Virtual Warehouse : Standard warehouses in Snowflake are ideal for most data
analytics applications, providing the necessary Snowflake compute resources to run efficiently.
The main key benefit of using standard warehouses is their flexibility—they can be started and
stopped at any moment as per your needs, without any restrictions. Also, standard warehouses
can be resized on the fly, even while they are currently under operation, to adjust to the varying
resource requirements of your data analytics applications, meaning that you can dynamically
allocate more or fewer compute resources to your virtual warehouse, depending on the type of
operations it is currently performing. In our upcoming article, we'll go into detail about how a
virtual warehouse works, including what snowflake pricing is, how much a virtual warehouse
costs, and what its benefits and best practices are.
The various sizes available for Standard virtual warehouse are listed below, indicating the
amount of compute resources per cluster. As you move up the size ladder, the hourly credit cost
for running the warehouse gets doubled.
Snowflake Warehouse Size (Source: docs.snowflake.com)
Snowpark-optimized Virtual Warehouse : Snowpark-optimized Virtual warehouse provides 16x
memory per node compared to a standard Snowflake virtual warehouse and is recommended for
processes with high memory requirements, such as machine learning training datasets on a single
virtual warehouse node[3].
Creating a Snowpark-optimized virtual warehouse in Snowflake is just as straightforward as
creating a standard virtual warehouse. You can do this by writing or copying the SQL script
provided below. The script creates a SNOWPARK_OPT_MEDIUM virtual warehouse of medium
size, "SNOWPARK-OPTIMIZED" warehouse type, which is optimized for Snowpark operations.
create or replace warehouse SNOWPARK_OPT_MEDIUM with
warehouse_size = 'MEDIUM'
warehouse_type = 'SNOWPARK-OPTIMIZED';

Create Snowpark-optimized virtual warehouse


The table below lists the various sizes available for a Snowpark Optimized Virtual Warehouse,
indicating the amount of compute resources per cluster. The number of credits charged per hour
that the warehouse runs doubles as you progress from one size to the next (but as you can see,
for X-Small and Small, it's not available).

Snowpark-optimized Warehouses Billing (Source: docs.snowflake.com)


Note: The larger warehouse sizes, 5X-Large and 6X-Large, are available for all
Amazon Web Services (AWS) regions. These sizes are currently in preview mode in
specific regions, including US Government regions and Azure regions.
Snowflake pricing for warehouses only occurs when they are running. If a warehouse is
suspended, it stops calculating credit usage charges. Credits are billed per second, rounded to the
nearest thousandth of a credit, with a minimum charge of 60 seconds, which means that even if a
process only runs for a few seconds, it will be billed as 60 seconds.
Note: Once the first minute(60 second) has passed, all subsequent billing is based on
the per-second rate until the warehouse is shut down[4]. If a warehouse is suspended
and then resumed within the first minute, multiple charges will occur as the 1-minute
minimum will start over each time the warehouse is resumed. Also, resizing a
warehouse from 5X- or 6X-Large to a smaller size, such as 4X-Large, will result in a
brief period during which the warehouse is charged for both the new and old resources
while the old resources get paused.
Snowflake Serverless Compute
Serverless compute in Snowflake refers to the compute resources managed by Snowflake rather
than standard virtual warehouse processes. Serverless credit usage refers to utilizing these
compute resources provided and managed by Snowflake[2]. Snowflake automatically adjusts the
serverless compute resources based on the needs of each workload.
The following is a list of available Snowflake Serverless features:
 Automatic Clustering : Automatic Clustering is a Snowflake serverless feature that
efficiently manages the process of reclustering tables as and when required, without any
intervention from the user.
 External Tables : A typical table stores its data directly in the database, while an external
table is designed to keep its data in files on an external stage. External tables contain
file-level information about the data files, such as the file name, version identifier—and
other properties. This makes it possible for users to query data stored in external staging
files as if it were kept within the database.
Note: External tables are read-only.
 Materialized views : A materialized view is a pre-computed data set based on a specific
query and stored for future use. The main benefit of using a materialized view is
improved performance because the data has already been computed, making querying the
view faster than executing the query against the base table. This performance
improvement can be substantial, particularly when the query is fired frequently.
 Query Acceleration Service : A query acceleration service is a feature that can enhance
the performance of a warehouse by optimizing the processing of certain parts of the
query workload. It minimizes the impact of irregular queries, which consume more
resources than the average query, thus improving warehouse performance.
 Search Optimization Service: Search Optimization Service has the potential to make a
big difference in the speed of queries that look up small subsets of the data using
equality or substring conditions.
 Snowpipe : Snowflake's Snowpipe is a Continuous Data Ingestion service that
automatically loads data as soon as it is available in a stage[5]. Snowpipe's serverless
compute model helps users initiate data loading of any size without managing a virtual
warehouse. Snowflake provides and manages the necessary compute resources,
dynamically adjusting the capacity based on the current Snowpipe load.
 Serverless tasks: Serverless tasks in Snowflake are designed to execute various types of
SQL code, including single SQL statements, calls to stored procedures, and procedural
logic. The versatility of Snowflake serverless tasks makes it ideal for usecases in various
scenarios. Serverless tasks can also be used independently to generate periodic reports
by inserting or merging rows into a report table or performing other recurring tasks.
Serverless feature charges are calculated based on Snowflake-managed compute resources' total
usage, measured in compute hours. Compute-hours are calculated per second, rounded up to the
nearest whole second. The amount of credit consumed per compute hour varies based on the
specific serverless feature.
Let's dig deep into how Snowflake pricing for the above-mentioned serverless features works.
Features Snowflake Credits per Compute-Hour

Automatic Clustering 2

External tables 2

Materialized views 10

Query Acceleration Service 1

Replication 2

Search Optimization Service 10

Snowpipe 1.25

Serverless tasks 1.5


These serverless features provide numerous advantages, ranging from improved performance to
improved data operations.
Note: Credits are consumed based on the total usage of a particular feature. For
example, automatic clustering and replication consume 2 credits per compute hour,
materialized views and search optimization service consume 10 credits per compute
hour, query acceleration service consumes 1 credit per compute hour, Snowpipe
consumes 1.25 credits per compute hour, and serverless tasks consume 1.5 credits per
compute hour. So, before deciding to use certain serverless features, you should
always be careful to think about how much credit these services will cost.
Snowflake Cloud Services Compute
Snowflake Cloud services layer is a combination of services that manage various tasks within
Snowflake, including user authentication, security, query optimization, request caching—and a
whole lot more.
This layer integrates all components of Snowflake, including the use of virtual warehouses. It is
designed with stateless computing resources that run across multiple availability zones, utilizing
a highly available, globally managed metadata store for state management. The Cloud Services
Layer operates on compute instances supplied by the cloud provider, and Snowflake credits are
used to pay for usage.
Snowflake Cloud Services include the following features:
 User Authentication
 Query parsing and optimization
 Metadata management
 Access control
 Request query caching
How does Snowflake Pricing for Cloud Services work?
Like virtual warehouse usage, cloud service usage is paid for with Snowflake credits but with a
unique model. You only pay if your daily cloud service usage exceeds 10% of your total virtual
warehouse usage, calculated daily in the UTC time zone for accurate Snowflake credit pricing.
For example, if you use 120 Snowflake Warehouse credits daily, you get a 12-credit discount
(10% of 120 = 12). If you use 12 or fewer credits in Cloud Services, they are free of charge. If
you use 13 credits for Cloud Services, you'll pay for the 1 credit above 12. However, unused
discounts won't be transferred or adjusted with other compute usage.
Note: Serverless compute does not affect the 10% adjustment for cloud services. The
10% adjustment is calculated by multiplying the daily virtual warehouse usage by 10%
and is performed in the UTC zone.
Date Compute Credits Cloud Credit Credits Billed (Sum of
Used (Warehouses Services Adjustment for Compute, Cloud Services,
only) Credits Used Cloud Services and Adjustment)

Jan 1 100 20 -10 110

Jan 2 120 10 -10 120

Jan 3 80 5 -5 80

Jan 4 100 13 -10 103

Total 400 48 -35 413


Monitoring Snowflake Compute Costs
As we saw earlier, the total Snowflake compute cost in Snowflake is composed of the utilization
of various components. These include virtual warehouses, which are computing resources
controlled and managed by the user; serverless features such as Automatic Clustering and
Snowpipe, which utilize Snowflake-managed computing resources; and the cloud services layer
of the Snowflake architecture, which also contributes to the total Snowflake compute cost.
To better understand past Snowflake compute costs, one can utilize Snowsight, Snowflake's web
interface, or directly write SQL queries using
the ACCOUNT_USAGE and ORGANIZATION_USAGE schemas. Snowsight provides a
convenient and quick way to access cost information through its visual dashboard.
Viewing Snowflake Credit Usage
In Snowflake, all compute resources consume Snowflake credits, including virtual warehouses,
serverless features, and cloud services. An account admin can easily monitor the total Snowflake
compute credit usage through Snowsight, which provides a simple and easy-to-use interface for
monitoring the Snowflake pricing.
To explore Snowflake compute costs using Snowsight, the administrator(ACCOUNTADMIN)
can navigate to the Usage section under the Admin tab.

Admin section and Usage dropdown


Once you do that, select Compute from the All Usage Types drop-down menu to view the
detailed Snowflake compute usage information.
All-usage-types dropdown
Snowflake Credit Monitoring: Filter By Tag
Filter by tag is a really useful feature in Snowflake that allows you to associate the cost of using
resources with a particular unit. A tag is an object in Snowflake that can be assigned one or more
values. Users with the rightful privileges can apply these tags and values to resources used by
cost centers or other logical units, such as a development environment or business unit, allowing
them to isolate costs based on specific tag and value combinations.
To view the costs associated with a specific tag and value combination in Snowsight, users can
open the Usage dashboard and select the desired tag from the Tags drop-down menu. And then,
select the desired value from the list of values associated with the tag and apply the filter by
selecting "Apply."

Filter By Tag
Snowflake Credit Consumption by Type, Service, or Resource
When analyzing the compute history through the bar graph display, you can filter the data based
on three criteria:
 By Type
 By Service
 By Resource.
Each filter provides a unique perspective on resource consumption and helps you get a more in-
depth understanding of the Snowflake pricing.
By Type allows you to separate the resource consumption into two categories:
 Compute (including virtual warehouses and serverless resources)
 Cloud services
This filter provides a clear distinction between the two types of compute resources.
Filter By Type
By Service separates resource consumption into warehouse consumption and consumption by
each serverless feature. Cloud services compute are included in the warehouse consumption.

Filter By Service
By Resource, on the other hand, separates resource consumption based on the Snowflake object
that consumed the credits.

Filter By Resource
Using SQL Queries for Viewing Snowflake Compute Costs
Snowflake offers two powerful
schemas: ORGANIZATION_USAGE and ACCOUNT_USAGE. These Schemas contain rich
information about Snowflake credit usage and Snowflake pricing. They provide granular usage
data, ready for analytics, and can be used to create custom reports and dashboards.
The majority of the views in these schemas present the cost of compute resources in terms of
credits consumed. But suppose you're interested in exploring the cost in currency rather than
credits. In that case, you can write queries using the USAGE_IN_CURRENCY_DAILY View,
which converts credits consumed into cost in currency using the daily price of credit.
The following views provide wide and important insights into the cost of compute resources and
help you to make data-driven decisions on resource usage and cost control. For more thorough
information on the cost of computing, refer to the following table:
View Compute Resource Schema

AUTOMATIC_CLUSTERING_HISTORY Serverless ORGANIZATION_USAGE


ACCOUNT_USAGE

DATABASE_REPLICATION_USAGE_HIST Serverless ACCOUNT_USAGE


ORY

MATERIALIZED_VIEW_REFRESH_HISTO Serverless ORGANIZATION_USAGE


RY ACCOUNT_USAGE

METERING_DAILY_HISTORY Warehouses ORGANIZATION_USAGE


Serverless ACCOUNT_USAGE
Cloud Services

METERING_HISTORY Warehouses ACCOUNT_USAGE


Serverless
Cloud Services

PIPE_USAGE_HISTORY Serverless ORGANIZATION_USAGE


ACCOUNT_USAGE

QUERY_ACCELERATION_HISTORY Serverless ACCOUNT_USAGE

REPLICATION_USAGE_HISTORY Serverless ORGANIZATION_USAGE


ACCOUNT_USAGE

REPLICATION_GROUP_USAGE_HISTOR Serverless ORGANIZATION_USAGE


Y ACCOUNT_USAGE

SEARCH_OPTIMIZATION_HISTORY Serverless ACCOUNT_USAGE

SERVERLESS_TASK_HISTORY Serverless ACCOUNT_USAGE

USAGE_IN_CURRENCY_DAILY Warehouses ORGANIZATION_USAGE


Serverless
Cloud Services

WAREHOUSE_METERING_HISTORY Warehouses ORGANIZATION_USAGE


Cloud Services ACCOUNT_USAGE
Sample SQL Queries for Monitoring Snowflake Compute Costs
The following queries provide a deep dive into the data contained within
the ACCOUNT_USAGE views, offering a closer look at Snowflake compute costs.
Calculating Snowflake Compute Costs: Virtual Warehouses
Average hour-by-hour Snowflake credits over the past 30 days
The following query gives a full breakdown of credit consumption on a per-hour basis, giving
you a clear picture of consumption patterns and trends over the past 30 days. This detailed
information can help you identify peak usage times and fluctuations in consumption, allowing
for a deeper understanding of resource utilization.
select start_time, warehouse_name, credits_used_compute
from snowflake.account_usage.warehouse_metering_history
where start_time >= dateadd(day, -30, current_timestamp())
and warehouse_id > 0 order by 1 desc, 2;

-- by hour
select date_part('HOUR', start_time) as start_hour, warehouse_name,
avg(credits_used_compute) as credits_used_compute_avg
from snowflake.account_usage.warehouse_metering_history
where start_time >= dateadd(day, -30, current_timestamp())
and warehouse_id > 0
group by 1, 2
order by 1, 2;

Average credit consumption on a per-hour, past 30 days


Credit consumption by warehouse over a specific period of time
This query shows the total credit consumption for each warehouse over a specific time period.
This helps identify warehouses that are consuming more credits than others and specific
warehouses that are consuming more credits than anticipated.
-- Credits used (all time = past year)
select warehouse_name
,sum(credits_used_compute) as credits_used_compute_sum
from account_usage.warehouse_metering_history
group by 1
order by 2 desc;

-- Credits used (past 30 days)


select warehouse_name,
sum(credits_used_compute) as credits_used_compute_sum
from account_usage.warehouse_metering_history
where start_time >= dateadd(day, -30, current_timestamp())
group by 1
order by 2 desc;

Credit consumption by warehouse, specific time period‌‌


Snowflake Warehouse usage over 30-day average
This query calculates the average daily credit usage of each individual warehouse over the past
30 days, along with the percentage increase of each day's usage compared to that average. It
filters the results only to show warehouses that have more than 0 credits and have had a day with
a usage increase of 50% or more compared to their average. The results are then ordered by the
percentage increase. This query helps to identify anomalies in credit consumption for
warehouses across weeks from the past year.
with cte_date_wh as(
select to_date(start_time) as start_date
,warehouse_name
,sum(credits_used) as credits_used_date_wh
from snowflake.account_usage.warehouse_metering_history
group by start_date
,warehouse_name
)
select start_date
,warehouse_name
,credits_used_date_wh
,avg(credits_used_date_wh) over (partition by warehouse_name order
by start_date rows 60 preceding) as credits_used_m_day_avg
,100.0*((credits_used_date_wh / credits_used_m_day_avg) - 1) as
pct_over_to_m_day_average
from cte_date_wh
where credits_used_date_wh > 0
order by pct_over_to_m_day_average desc
Average daily credit usage per warehouse, past 30days
Calculating Snowflake Compute Costs: Automatic Clustering
Automatic Clustering cost history (by day, by object)
This query summarizes the credit consumption for automatic clustering over the past month by
date, database, schema, and table name, and orders the result in descending order by total credits
used.
select to_date(start_time) as date
,database_name
,schema_name
,table_name
,sum(credits_used) as credits_used

from "SNOWFLAKE"."ACCOUNT_USAGE"."AUTOMATIC_CLUSTERING_HISTORY"

where start_time >= dateadd(month,-1,current_timestamp())


group by 1,2,3,4
order by 5 desc;
Automatic Clustering cost history
Calculating Snowflake Compute Costs: Search Optimization
Search Optimization cost history (by day, by object)
This query summarizes the total credits consumed via the services in Snowflake over the past
month by date, database name, schema name, and table name, and orders the results in
descending order of credits used.
select
to_date(start_time) as date
,database_name
,schema_name
,table_name
,sum(credits_used) as credits_used

from "SNOWFLAKE"."ACCOUNT_USAGE"."SEARCH_OPTIMIZATION_HISTORY"
where start_time >= dateadd(month,-1,current_timestamp())
group by 1,2,3,4
order by 5 desc;
Search Optimization History
This query calculates the average daily credits used for each week over the past year, then
proceeds to group the credits used by day, then groups the average daily credits by week, and
orders the results by the date the week starts.
with credits_by_day as (
select to_date(start_time) as date
,sum(credits_used) as credits_used

from "SNOWFLAKE"."ACCOUNT_USAGE"."SEARCH_OPTIMIZATION_HISTORY"

where start_time >= dateadd(year,-1,current_timestamp())


group by 1
order by 2 desc
)

select date_trunc('week',date)
,avg(credits_used) as avg_daily_credits
from credits_by_day
group by 1
order by 1;
Calculating Snowflake Compute Costs: Materialized Views
Materialized Views cost history (by day, by object)
This query generates a complete report of Materialized Views and the number of credits utilized
by the service in the past 30 days, which is then sorted by day.
select
to_date(start_time) as date
,database_name
,schema_name
,table_name
,sum(credits_used) as credits_used

from "SNOWFLAKE"."ACCOUNT_USAGE"."MATERIALIZED_VIEW_REFRESH_HISTORY"

where start_time >= dateadd(month,-1,current_timestamp())


group by 1,2,3,4
order by 5 desc;
Materialized Views History & m-day average
This query illustrates the average daily credits consumed by Materialized Views grouped by
week over the last year. It helps highlight deviations or changes in the daily average, providing
an opportunity to examine significant spikes or shifts in Snowflake credit consumption.
with credits_by_day as (
select to_date(start_time) as date
,sum(credits_used) as credits_used

from "SNOWFLAKE"."ACCOUNT_USAGE"."MATERIALIZED_VIEW_REFRESH_HISTORY"
where start_time >= dateadd(year,-1,current_timestamp())
group by 1
order by 2 desc
)

select date_trunc('week',date)
,avg(credits_used) as avg_daily_credits
from credits_by_day
group by 1
order by 1;
Calculating Snowflake Compute Costs: Snowpipe
Snowpipe cost history (by day, by object)
This query provides a full list of pipes and the volume of credits consumed via the service over
the last 1 month, which is then sorted by day.
select
to_date(start_time) as date
,pipe_name
,sum(credits_used) as credits_used

from "SNOWFLAKE"."ACCOUNT_USAGE"."PIPE_USAGE_HISTORY"
where start_time >= dateadd(month,-1,current_timestamp())
group by 1,2
order by 3 desc;
Calculating Snowflake Compute Costs: Replication
Replication cost history
This query provides a full list of replicated databases and the volume of credits consumed via
the replication service over the last 1 month (30days) period.
select
to_date(start_time) as date
,database_name
,sum(credits_used) as credits_used
from "SNOWFLAKE"."ACCOUNT_USAGE"."REPLICATION_USAGE_HISTORY"
where start_time >= dateadd(month,-1,current_timestamp())
group by 1,2
order by 3 desc;
Replication History & m-day average
This query provides the average daily credits consumed by Replication grouped by week over
the last 1 year.
with credits_by_day as (
select to_date(start_time) as date
,sum(credits_used) as credits_used

from "SNOWFLAKE"."ACCOUNT_USAGE"."REPLICATION_USAGE_HISTORY"

where start_time >= dateadd(year,-1,current_timestamp())


group by 1
order by 2 desc
)

select date_trunc('week',date)
,avg(credits_used) as avg_daily_credits
from credits_by_day
group by 1
order by 1;
Calculating Snowflake Compute Costs: Cloud Services
Warehouses with high cloud services usage
This query shows the warehouses that are not using enough warehouse time to cover the cloud
services portion of compute.
select
warehouse_name
,sum(credits_used) as credits_used
,sum(credits_used_cloud_services) as credits_used_cloud_services
,sum(credits_used_cloud_services)/sum(credits_used) as
percent_cloud_services
from "SNOWFLAKE"."ACCOUNT_USAGE"."WAREHOUSE_METERING_HISTORY"
where to_date(start_time) >= dateadd(month,-1,current_timestamp())
and credits_used_cloud_services > 0
group by 1
order by 4 desc;

7 Best Practices for Optimizing Snowflake Compute Costs


Snowflake provides a usage-based pricing model for its users, where costs are determined based
on how much Snowflake compute and storage is used. To effectively manage Snowflake costs,
data professionals must ensure their Snowflake credits are being used in the most efficient way
possible.
To help with Snowflake cost management, here are a few best practices you can use for
optimizing Snowflake compute costs:
1) Snowflake Warehouse Size Optimization
Selecting the right warehouse size for Snowflake can be a tricky decision. Snowflake compute
costs increase as you move up in warehouse size, so it is important to consider carefully what
size is best for your workload. Generally, larger warehouse tend to run queries faster, but getting
the optimal size requires trial and error. An effective strategy may be starting with a smaller
warehouse and gradually working your way up based on performance outcomes. Check out this
article for a deeper, more in-depth understanding of optimizing your Snowflake warehouse size.
2) Optimizing Cluster and Scaling Policy for Snowflake Warehouses
Managing compute clusters and scaling policy are two essential aspects when aiming to reduce
Snowflake compute costs. Most times, it is advised to begin with minimal clusters before
gradually growing the number of clusters depending on the workload activities. This
functionality is only available for users of Enterprise or higher tiers. For most workloads, the
scaling policy is set to Standard. But if queuing is not an issue for your workload, you can set it
to Economy, which will conserve credits by trying to keep the running clusters fully loaded.
Note: The Economy scaling policy is also only available in the Enterprise edition or
above.
3) Workload Optimization in Snowflake
Group your workloads together and opt for a smaller warehouse to start. According
to Snowflake, use only the queries that can be completed within 5 to 10 minutes or less. Monitor
your queue to ensure jobs don't spend excessive time waiting, as this will increase execution
time and credit consumption for the overall workload.
4) Snowflake Monitoring with Snowflake Resource Monitors
Once you have grouped your workloads and chosen the appropriate warehouse size, set up
resource monitors to detect any problems. Snowflake Resource monitors can help you detect any
problems with the workloads. They can either stop/suspend all activity or send alert notifications
when Snowflake consumption reaches specific thresholds. For more info, check out this
article which goes into depth about the Snowflake Resource monitor and how to use it to
optimize Snowflake compute costs.
5) Snowflake Observability with Chaos Genius
Snowflake compute costs can quickly add up. The Snowflake resource monitor may not provide
the detailed insights you need to manage your costs and keep them from getting out of control.
That's where Chaos Genius comes in. This DataOps observability platform provides actionable
insights into your Snowflake spending and usage, allowing you to break down costs into
meaningful insights, optimize your Snowflake usage, and reduce Snowflake compute costs.
Chaos Genius
dashboard
Using Chaos Genius, you can reduce spending on Snowflake compute costs by up to 30%. It
provides an in-depth view and insight into workflows, enabling you to pinpoint areas that
generate high costs and adjust usage accordingly. It gives you unmatched speed and flexibility in
monitoring your expenses, so that you can accurately identify which features or products are
driving up your compute costs and take corrective measures without sacrificing performance.
Chaos Genius provides real-time notifications that inform you about cost anomalies as soon as
they occur, allowing you to respond swiftly and effectively. Take advantage of this powerful
service today – book a demo with us now and see how it can transform your business.
6) Snowflake Query Optimization
Keep in mind that querying data in Snowflake consumes credits. The trick for Snowflake
compute costs optimization is to tweak the query code and settings for efficient operation
without affecting job performance. Read more on Snowflake Query Optimization in our article
here.
7) Snowflake Materialized Views to minimize resource usage
Another way to reduce the Snowflake compute costs is to make use of materialized views. This
feature, available only in Snowflake's Enterprise Edition, involves pre-computing large data sets
for later use and querying them more efficiently. A materialized view will be especially
beneficial when the underlying table or subset of rows do not change that often and consume a
lot of resources such as processing time, credits and storage space while generating query
results.
Conclusion
Snowflake's innovative pricing model has disrupted the cloud-based data platform industry, but
navigating the Snowflake costs can be daunting.
Optimizing Snowflake compute costs is the key to unlocking the full potential of this
revolutionary data warehousing solution. You can save big dollars without sacrificing
performance by following best practices, tips and leveraging features like Snowflake Resource
Monitors and Snowflake Observability tools like Chaos Genius. Make sure to take full benefit of
Snowflake's flexible pricing structure and use this guide as your go-to resource for reducing
Snowflake compute costs!!

FAQs
What is a virtual warehouse in Snowflake?
Virtual warehouse in Snowflake is a cluster of computing resources used to execute queries, load
data and perform DML(Data Manipulation Language) operations. It consists of CPU cores,
memory, and SSD and can be quickly deployed and resized as needed.
What is the pricing model for Snowflake virtual warehouses?
Snowflake virtual warehouses are priced based on the number of warehouses used, their runtime,
and their actual size. The pricing varies for different sizes of virtual warehouses.
How are serverless compute costs calculated in Snowflake?
Serverless compute costs in Snowflake are calculated based on the total usage of the specific
serverless feature. Each serverless feature consumes a certain number of Snowflake credits per
compute hour.
What are some serverless features in Snowflake?
Serverless features in Snowflake include automatic clustering, external tables, materialized
views, query acceleration service, search optimization service, Snowpipe, and serverless tasks.
What is Snowflake Cloud Services Compute?
Snowflake Cloud Services Compute refers to a combination of services that manage various
tasks within Snowflake, such as user authentication, query optimization, and metadata
management.
How does Snowflake pricing work for Cloud Services?
Cloud service usage is paid for with Snowflake credits, and you only pay if your daily cloud
service usage exceeds 10% of your total virtual warehouse usage.
Can I view Snowflake compute costs in currency instead of credits?
Yes, Snowflake provides the USAGE_IN_CURRENCY_DAILY view to convert credits
consumed into cost in currency using the daily price of credits.
Is Snowflake compute cost expensive?
Snowflake compute cost can be expensive, but it depends on your specific workloads. If you
only need a small amount of compute power, Snowflake can be a cost-effective option.
However, if you need a lot of compute power, Snowflake can be more expensive than traditional
on-premises data warehouses
Snowflake Zero Copy Clone 101—An Essential Guide (2024)

Snowflake zero copy clone is an incredibly useful and advanced feature that allows users to
clone a database, schema, or table quickly and easily without any additional Snowflake storage
costs. What's more, it takes only a few minutes for Snowflake zero copy clone to complete
without the need for complex manual configuration, as often done in conventional databases—
depending on the size of the source item. This article covers all you need to know about
Snowflake zero copy clone.
Let's dive in!
What is Snowflake zero copy clone?
Snowflake zero copy clone, often referred to as "cloning", is a feature in Snowflake that
effectively creates an exact copy of a database, table, or schema without consuming extra
storage space, taking up additional time, or duplicating any physical data. Instead, a logical
reference to the source object is created, allowing for independent modifications to both the
original and cloned objects. Snowflake zero copy cloning is fast and offers you maximum
flexibility with no additional Snowflake storage costs associated with it.
Use-cases of Snowflake zero copy clone
Snowflake zero copy clone provides users with substantial flexibility and freedom, with use
cases like:
 To quickly perform backups of Tables, Schemas, and Databases.
 To create a free sandbox to enable parallel use cases.
 To enable quick object rollback capability.
 To create various environments (e.g., Development,Testing, Staging, etc.).
 To test possible modifications or developments without creating a new environment.
Snowflake zero copy clone provides businesses with smarter, faster, and more flexible data
management capabilities.
How does Snowflake zero copy clone work?
The Snowflake zero copy clone feature allows users to clone a database object without making a
copy of the data. This is possible because of the Snowflake micro-partitions feature, which
divides all table data into small chunks that each contain between 50 and 500 MB of
uncompressed data. However, the actual size of the data stored in Snowflake is smaller because
the data is always stored compressed. When cloning a database object, Snowflake simply creates
new metadata entries pointing to the micro-partitions of the original source object, rather than
copying it for storage. This process does not involve any user intervention and does not
duplicate the data itself—that's why it's called "zero copy clone".
To gain a better understanding, let's deep dive even further.
To illustrate this, consider a database table, EMPLOYEE table, and its cloned
snapshot, EMPLOYEE_CLONE, in a Snowflake database. The metadata layer in Snowflake
connects the metadata of EMPLOYEE to the micro-partitions in the storage layer where the
actual data resides. When the EMPLOYEE_CLONE table is created, it generates a new metadata
set pointing to the same micro-partitions storing the data for EMPLOYEE. Essentially, the
clone EMPLOYEE_CLONE table is a new metadata layer for EMPLOYEE rather than a physical
copy of the data. The beauty of this approach is that it enables us to create clones of tables
quickly without duplicating the actual data, saving time and storage space. Moreover, since the
clone shares the same set of micro-partitions as the original table, any changes made to the data
in one table will automatically reflect in the other.

Snowflake zero copy clone illustration


In Snowflake, micro-partitions cannot be changed/altered once they are created. Suppose any
modifications to the data within a micro-partition need to be made. In that case, a new micro-
partition must be created with the updated changes (the existing partition is maintained to
provide fail-safe measures and time travel capabilities). For instance, when data in
the EMPLOYEE_CLONE table is modified, Snowflake replicates and assigns the modified
micro-partition (M-P-3) to the staging environment, updating the clone table with the newly
generated micro-partition (M-P-4) and references it exclusively for
the EMPLOYEE_CLONE table, thereby incurring additional Snowflake storage costs only for
the modified data rather than the entire clone.
Cloned Data illustration
What are the benefits of Snowflake zero copy clone?
Snowflake zero copy clone feature offers a variety of beneficial characteristics. Let's look at
some of the key benefits:
 Effective data cloning: Snowflake zero copy clone allows you to create fully-usable
copies of data without physically copying the data, significantly reducing the time
required to clone large objects.
 Saves storage space and costs: It doesn't require the physical duplication of data or
underlying storage, and it doesn't consume additional storage space, which can save on
Snowflake costs.
 Hassle-free cloning: It provides a straightforward process for creating copies of your
tables, schemas, and databases using the keyword "CLONE" without needing
administrative privileges.
 Single-source data management: It creates a new set of metadata pointing to the same
micro-partitions that store the original data. Each clone update generates new micro-
partitions that relate solely to the clone.
 Data Security: It maintains the same level of security as the original data. This ensures
that sensitive data is protected even when it's cloned.
What are the limitations of Snowflake zero copy clone?
Snowflake zero copy clone feature offers many benefits. Still, there are certain limitations to
keep in mind:
 Resource requirements and performance impact: Cloning operations require adequate
computing resources, so excessive cloning can lead to performance degradation.
 Longer clone time for large micro-partitions: Cloning a table with a large number of
micro-partitions may take longer, although it is still faster than a traditional copy.
 Unsupported Object Types for Cloning: Cloning does not support all object types.
Which are the objects supported in Snowflake zero copy clone?
Snowflake zero copy clone feature supports cloning of the following database objects:
 Databases
 Schemas
 Tables
 Views
 Materialized views
 Sequences
Note: When a database object is cloned, the clone is not similar to the source object;
rather, the clone is a reference to the original object, and modifications to the clone do
not affect the source object. The clone will contain a new set of metadata, including a
new set of access controls; so, the user must ensure that the appropriate permissions
are granted for the clone.
How do access control works with cloned objects in Snowflake?
When using Snowflake's zero copy clone feature, it's important to keep in mind that cloned
objects do not automatically inherit copy privileges from the source object. This means that an
account admin(ACCOUNTADMIN) or the owner of the cloned object must explicitly grant any
required privileges to the newly created clone.
If the source object is a database or schema, the granted privileges of any child objects in the
source will be replicated to the clone. But, in order to create a clone, the current role must have
the necessary privileges on the source object. For example, tables require the SELECT privilege,
while pipelines, streams, and tasks require the OWNERSHIP privilege, and other object types
require the USAGE privilege.
What are the account-level objects not supported in Snowflake zero copy clone?
Snowflake zero copy clone doesn't support particular objects that cannot be cloned. These
include account-level objects, which exist at the account level. Some examples of account-level
objects are:
 Account-level roles
 Users
 Grants
 Virtual Warehouses
 Resource monitors
 Storage integrations
Conclusion
Snowflake zero copy clone feature provides an innovative and cost-efficient way for users to
clone tables without using additional Snowflake storage costs. This process streamlines the
workflow, allowing databases, tables, and schemas to be cloned without creating separate
environments.
This article provided an in-depth overview of Snowflake zero copy clone, from how it works to
its potential use cases, and demonstrated how to set up and utilize the feature.
If you're interested in delving into a comprehensive guide that walks you through the process of
creating a Snowflake zero copy clone table from the ground up, be sure to take a look at this
article!

FAQs
Why is it called zero copy clone?
The term "Zero Copy Clone" is used because Snowflake's cloning process doesn't involve
physical data copying. It creates a reference to the source data, eliminating the need for
duplication and resulting in zero additional storage costs.
How does Snowflake zero copy clone work?
Snowflake zero copy clone works by creating new metadata entries that point to the micro-
partitions of the original source object instead of making a physical copy of the data.
What are the advantages of zero copy cloning Snowflake?
 Effective data cloning without physical duplication, saving time.
 Storage space and cost savings as it doesn't consume additional storage.
 Hassle-free cloning process using the "CLONE" keyword.
 Single-source data management with new metadata for each clone.
 Maintaining data security and access controls.
What are the limitations of Snowflake zero copy clone?
 Resource requirements and potential performance impact.
 Longer clone time for tables with a large number of micro-partitions.
 Not all object types are supported for cloning.
Which objects are supported in Snowflake Zero Copy Cloning?
 Databases
 Schemas
 Tables
 Views
 Materialized views
 Sequences
Can Snowflake objects be cloned?
Yes, individual external named stages in Snowflake can be cloned. External stages refer to
buckets or containers in external cloud storage. Cloning an external stage does not affect the
referenced cloud storage. However, internal (Snowflake) named stages cannot be cloned.
Can you clone Internal named stages ?
No, Internal named stages cannot be cloned.
How does Zero Copy Cloning save time and money?
Zero Copy Cloning eliminates the need for creating multiple development environments in
separate accounts, reducing costs and time spent on creating large copies of production tables.
Snowflake Resource Monitors 101: A Comprehensive Guide (2024)

Snowflake Resource Monitors is a powerful feature that helps in Snowflake monitoring and
avoid unexpected credit usage. It's the only official monitoring tool from Snowflake that lets you
monitor your credit consumption and control your warehouses, making it essential for
Snowflake users. In this article, we’ll explain how the Resource Monitors work, how to set it up,
how to customize notifications and define actions—and much more!!
What is Snowflake Resource Monitors?
Snowflake Resource Monitors is a feature in Snowflake that helps you manage and control
Snowflake costs. It allows you to monitor and set limits on your compute resources, so you can
avoid overspending and potentially going over your budget. By setting up Resource Monitors,
you can define actions to be taken when usage thresholds are exceeded, such as sending alerts or
suspending warehouses. This gives you more control over your Snowflake data warehouse and
helps you stay within your budget.
What Features Do Snowflake Resource Monitors Offer?
Snowflake Resource Monitors is a great feature for anyone using Snowflake. It offers a wealth
of features that can help you better monitor and manage your Snowflake workloads and
resources, giving you more control over your Snowflake environment.
Some of the important features of the Snowflake Resource Monitors are:
 Credit usage monitoring: It allows you to monitor your credit usage in real-time, so you
can stay updated on your account's resource consumption.
 Multiple levels of control: Snowflake Resource Monitors can be set at the account level
or warehouse level, giving users high flexibility and control over their resources.
 Credit quota management: It allows you to set credit quotas for specific warehouses and
even for the entire account to help you stay within your budget and avoid unexpected
Snowflake costs.
 Custom notification + alerting: It can send notifications and alerts to users when credit
usage reaches specific thresholds.
 Custom Actions: It allows you to configure custom actions that can be triggered when
credit usage is exceeded (such as by suspending a warehouse or sending an alert).
How to setup Snowflake Resource Monitors using Snowflake Web UI?
Setting up Snowflake Resource Monitors is a straightforward process. To make one in
Snowflake, follow the steps outlined below:
Step 1: Log in to Snowflake Web UI
To setup Snowflake Resource Monitors, you need to log in to Snowflake Web UI.

Snowflake login page


Step 2: Navigate to the Snowflake Resource Monitors page
Once you have logged in to Snowflake Web UI, navigate to the Resource Monitors page. You
can find this page under the "Admin" tab.
Admin section and Usage dropdown
Step 3: Click on the "+Resource Monitors" button
Under "Admin", select "Resource Monitors" and then click "+Resource Monitors".

Resource Monitor dashboard


You will now see a "New Resource Monitor" window. In this window, you can set the Resource
Monitor’s name, the Resource Monitor’s frequency(credit quota), the credit usage threshold, and
the actions to be taken when the credit usage threshold is exceeded.
New Resource
Monitor configuration
Step 4: Configure the Credit Quota property
Inside the "New Resource Monitor" popup window, configure the Credit Quota property. This is
the number of credits that are allowed to be consumed in a given interval of time during which
the Resource Monitor takes action.

Credit Quota
Step 5: Configure the Monitor Type property
Next, let's configure the Monitor Type property. The Snowflake Resource Monitor can monitor
credit usage at the Account and Warehouse levels. If you have selected the Account level, the
Resource Monitors will monitor the credit usage for your entire account (i.e., all warehouses in
the account), and if you select the Monitor Type as Warehouse level, you need to individually
select the warehouses to monitor.
Account level

Account level
monitor
Warehouse level

Warehouse level
monitor
Step 6: Configure the Schedule property
Now that you have configured the Monitor Type property, let's configure the Schedule property.
By default, the scheduling of the Resource Monitor is set to begin monitoring immediately and
reset back the credit usage to 0 at the beginning of each calendar month. However, you can
customize the scheduling of the Resource Monitor to your needs.
 Time Zone: You have two options to set the schedule's time zone: Local and UTC.
 Starts: You can start the Resource Monitor immediately or later. If you choose Later,
you should enter the date and time for the Resource Monitor to start.
 Resets: You can choose the frequency interval at which the credit usage resets. The
supported values are Daily, Weekly, Monthly, Yearly, and Never.
 Ends: You can run the Resource Monitor continuously by selecting the "never" reset
option, or you can set it to stop at a certain date and time.
Customize
Resource Monitor schedule
Step 7: Configure the Actions property
Now that you have set up the basic properties of your Resource Monitor, it's time to define
the Actions property that will be taken when the credit usage threshold gets exceeded. As you
can see in the screenshot below, In the Actions property, there are three types of actions that
Resource Monitors provide by default:
Resource Monitors
actions configuration
Resource Monitors actions configuration
 Suspend Immediately: This action suspends all the assigned warehouses immediately,
which cancels any query statements being executed by the warehouses at that time.
 Suspend and notify: This action suspends all the assigned warehouses after all statements
being executed by the warehouse(s) have been completed and then sends an alert
notification to the users.
 Notify: This action sends an alert notification to all users with notifications enabled.
Note: You can define up to three actions for a Resource Monitor, which include
one Suspend and Notify action, one Suspend Immediately action, and up to five
custom Notify actions. Resource Monitor must have at least one action defined. If no
actions have been defined, you won't be able to create your Resource Monitor.
Step 8: Create the Resource Monitor
As you can see in the screenshot below, we have successfully configured the Resource Monitor.
So, once you've configured it, click the "Create Resource Monitor" button to create it. As you
can see, we created a Snowflake Resource Monitor with the name "ResourceMonitor_DEMO," a
Credit Quota of "200" at the warehouse level, one warehouse configured with the default
schedule, Suspend Immediately and Notify action set to trigger at 80% of credit usage, Suspend
and Notify action set to trigger at 75% of credit usage, and Finally Notify action set to trigger
at 50% & 60% of credit usage.
Resource Monitors
configuration

Resource Monitor dashboard


How to create a Snowflake Resource Monitors using SQL query?
Creating Snowflake Resource Monitors using SQL query is straightforward, so to do that,
Snowflake provides the following DDL commands for creating and using/managing Resource
Monitors: CREATE Resource Monitors, ALTER Resource Monitors, SHOW Resource
MonitorsS, and DROP Resource Monitors. However, for this article's purpose, we will be
focusing on creating only.
CREATE Resource Monitors "ResourceMonitor_DEMO2" WITH CREDIT_QUOTA = 300
TRIGGERS
ON 80 PERCENT DO SUSPEND
ON 75 PERCENT DO SUSPEND_IMMEDIATE
ON 60 PERCENT DO NOTIFY
ON 50 PERCENT DO NOTIFY;
This query creates a Resource Monitor named "ResourceMonitor_DEMO2" with a credit quota
of 300 and sets triggers for suspend and suspend immediate actions at 80 and 75 percent usage,
respectively. The monitor also includes a trigger for a notify action at 60 and 50 percent of
usage respectively.

Create Resource Monitor


Once the query has been executed, navigate back to the Resource Monitors page. There, should
see the newly created Resource Monitor, which will be in an active state.

Resource Monitor dashboard


How to Assign Warehouses to Resource Monitors?
Once you have successfully created Snowflake Resource Monitors using SQL, warehouses can
be easily assigned by using the following SQL query
ALTER WAREHOUSE "COMPUTE_WH" SET RESOURCE_MONITOR
="ResourceMonitor_DEMO2";

Assign warehouse to Resource Monitors


You can even configure and set your Resource Monitors to Account level; for that, execute the
following SQL query.
ALTER ACCOUNT SET RESOURCE_MONITOR = ResourceMonitor_DEMO2;
How to enable Notification Alerts for Account admin?
Account administrators can receive notifications via the web interface or via email. Notifications
are not enabled by default. It must be enabled manually, so follow the steps outlined below.
 Use the ACCOUNTADMIN role. If you are not already in this role, you can switch to it
by clicking on the drop-down menu next to your name in the upper-right corner,
selecting "Switch role", and then selecting "ACCOUNTADMIN" from the list.

User profile role


 In the same drop-down menu, select Profile > Toggle Notifications.
Toggle
notifications

How to enable Notification Alerts for Non-Admin Accounts?


Enabling email notifications for non-admin users can only be done through SQL statements(It's
not possible via the web interface). Use the SQL command shown below to trigger a notification
for Non-admin accounts:
CREATE Resource Monitors "ResourceMonitor_ALERT" WITH CREDIT_QUOTA = 300
NOTIFY_USERS = ('USER_NAME')
TRIGGERS
ON 80 PERCENT DO SUSPEND
ON 75 PERCENT DO SUSPEND_IMMEDIATE
ON 60 PERCENT DO NOTIFY
ON 50 PERCENT DO NOTIFY;
To view the list of users with access to email alerts for Resource Monitors, use the following
SQL command.
SHOW Resource Monitors

Enable notifications alert for Non-admin


Conclusion
Snowflake Resource Monitors is an invaluable feature for any Snowflake user who wants to gain
better control over their costs. The setup process is straightforward; this article explains how to
create a Resource Monitor, what it does, and how to set up notifications and trigger actions.
With the Snowflake Resource Monitors in place, you don't have to worry about unexpected
Snowflake costs. Get started on your Snowflake Monitoring journey today!

FAQs
How does Snowflake Resource Monitors help control Snowflake costs?
Snowflake Resource Monitors helps you manage and control Snowflake costs. It allows you to
monitor and set limits on your compute resources, so you can avoid overspending and
potentially going over your budget.
How to setup resource monitor in Snowflake using Snowflake Web UI?
To set up Snowflake Resource Monitors using Snowflake Web UI, log in, navigate to the
Resource Monitors page under the Admin tab, and click on the "+Resource Monitors" button.
What are the steps to configure a Resource Monitor's credit quota and schedule?
To configure a Resource Monitor's credit quota and schedule, specify the desired values in the
New Resource Monitor window, including credit quota, monitor type (account or warehouse
level), and schedule properties.
Can I define custom actions for Snowflake Resource Monitors?
Yes, Snowflake Resource Monitors allow you to define custom actions to be taken when credit
usage thresholds are exceeded, such as suspending warehouses or sending alerts.
Can I create Snowflake Resource Monitors using SQL queries?
Yes, Snowflake Resource Monitors can be created using SQL queries, using this simple query:
CREATE Resource Monitors "ResourceMonitor_DEMO" WITH CREDIT_QUOTA = 300
TRIGGERS
ON 80 PERCENT DO SUSPEND
ON 75 PERCENT DO SUSPEND_IMMEDIATE
ON 60 PERCENT DO NOTIFY
ON 50 PERCENT DO NOTIFY;
Can Resource Monitors be set at both the account and warehouse levels?
Yes, Resource Monitors can be set at both the account and warehouse levels, providing
flexibility and control over resource management.
Which schema has the resource monitor view in Snowflake?
This Account Usage view displays the resource monitors that have been created in the reader
accounts managed by the account. This view is only available in the
READER_ACCOUNT_USAGE schema.

Snowflake COPY INTO load behavior and performance


How to Use TRUNCATE TABLE to Delete Old Data Before Loading New Data?
Sometimes, you may want to delete the old data in the target table before loading the new data
from the stage, which helps you to avoid duplicate or outdated data, and keep the table clean and
up-to-date.
To do this, you can use the TRUNCATE TABLE command, which deletes all the data in the
table without affecting the table schema or metadata. You can use the TRUNCATE TABLE
command before the COPY INTO command to delete the old data before loading the new data.
For example:
-- Delete the old data
TRUNCATE TABLE <table>;

-- Load the new data from the stage to the table


COPY INTO my_table
FROM @my_stage;
Snowflake COPY INTO load behavior and performance
Conclusion
And that’s a wrap! Snowflake COPY INTO command is a powerful and flexible command that
allows you to load data from files in stages to tables, or unload data from tables or queries to
files in stages, using a simple and straightforward syntax. This command can help you to access,
transform, and analyze data from various sources and formats, and optimize the data
loading/unloading process.
In this article, we have covered:
 What is COPY INTO in Snowflake?
 Prerequisite Requirements for Using Snowflake COPY INTO
 How Does Snowflake COPY INTO Command Work?
 How to Load Data From External Stages to Snowflake Tables?
 How to Unload Data From Snowflake Tables to External Stages?
 How to Load Data From Internal Stages to Snowflake Tables?
 How Do You COPY INTO Validation in Snowflake?
 How to Unload Data From Snowflake Tables to Internal Stages?
 How Do I Copy a File From Snowflake to Local?
 Controlling Load Behavior and Performance Using Various Options
…and so much more!
Snowflake COPY INTO command is like a bridge that connects the data in the stages and the
tables. The bridge can be built and customized in different ways, depending on the source and
the destination of the data, and the options and the parameters that you choose.
FAQs
What is Snowflake COPY INTO command used for?
Snowflake COPY INTO command is used to load data from stages into tables or unload data
from tables into stages. It enables efficient data ingestion in Snowflake.
What are the different types of stages supported by Snowflake COPY INTO?
Snowflake COPY INTO supports both external stages like S3, Azure, GCS and internal stages
like user stages, table stages, and named stages.
What file formats does Snowflake COPY INTO command support?
Snowflake COPY INTO supports loading data from CSV, JSON, Avro, Parquet, ORC and other
file formats.
Does Snowflake COPY INTO command handle compression formats?
Yes, COPY INTO can load compressed files like GZIP and Snappy compressed files
automatically.
How does Snowflake COPY INTO command validate data during loads?
Snowflake COPY INTO provides a VALIDATION_MODE parameter to validate data before
load. It can return errors, warnings, or just sample rows during validation.
Can Snowflake COPY INTO command transform data during loads?
Yes, Snowflake COPY INTO provides options to run transformations on the data using SQL
before loading it into the target table.
How does Snowflake COPY INTO command optimize load performance?
Snowflake COPY INTO uses multi-threaded parallel loads to maximize throughput. Load
performance can be tuned using parameters like MAX_FILE_SIZE.
Does Snowflake COPY INTO command purge files after loading?
Yes, the PURGE copy option can be used to delete files from the stage after loading them.
How does Snowflake COPY INTO command handle errors during loads?
The ON_ERROR parameter controls error handling. Load can be aborted, or continued or files
can be skipped on errors.
How can I download a file from a Snowflake stage?
You can make use of the GET command to download files from a Snowflake internal stage to
local.
Can Snowflake COPY INTO be used for incremental data loads?
Yes, by using the FORCE copy option to reload files, COPY INTO can handle incremental
loads.
Does Snowflake COPY INTO command work across different Snowflake accounts/regions?
No, Snowflake COPY INTO works only within the same Snowflake account and region. It
cannot load data across accounts or regions.
Can Snowflake COPY INTO command load semi-structured data like XML?
Yes, Snowflake COPY INTO supports loading XML data using the XML file format option.
How does authentication work with external stages in COPY INTO?
Snowflake COPY INTO uses the credentials configured in the storage integration to authenticate
and access external stages.
Does Snowflake COPY INTO command load data in parallel?
Yes, Snowflake COPY INTO uses multi-threading for parallel loads to maximize throughput.
Can Snowflake COPY INTO command load whole folders from a stage at once?
No, Snowflake COPY INTO loads individual files. To load a whole folder, use wildcards or the
multiple files option.
Does Snowflake COPY INTO command validate schemas during loads?
No, Snowflake COPY INTO does not validate the schema. Use constraints on the table instead
to enforce schema validation.
Can Snowflake COPY INTO command calls be made idempotent?
Yes, by using the FORCE copy option, COPY INTO calls can be made idempotent to handle re-
runs.
What Is Snowflake REPLACE() Function?
Snowflake REPLACE() is a string and binary function that allows you to remove all occurrences
of a specified substring, and optionally replace them with another string. Snowflake REPLACE()
operates on character data types, such as VARCHAR, CHAR, TEXT and STRING .
Snowflake REPLACE() can be used for various purposes, such as:
 Removing unwanted prefixes/suffixes from a string
 Correcting spelling errors or typos in a string
 Changing the format or style of a string
 Updating outdated or incorrect information in a string
 Performing dynamic string manipulation based on expressions or variables
Snowflake REPLACE() is simple to use, but it also offers loads of flexibility and customization
options. In the next section, we will explain how the Snowflake REPLACE() works and what its
arguments are.
How Does a Snowflake REPLACE() Function Work?
As we have already covered, Snowflake REPLACE() function removes all occurrences of a
specified substring and optionally replaces them with another string. Put simply, Snowflake
REPLACE() performs a case-sensitive search to find occurrences of the substring specified in
the second argument within the original string. Whenever it encounters that substring, it will
replace it with the new replacement string provided as the third argument.
Syntax for REPLACE() is straightforward:
REPLACE( <subject> , <pattern> [ , <replacement> ] )
Snowflake REPLACE() takes 3 arguments:
 <subject>: String value to be searched and modified, which can be a literal value, a
column name, or an expression that returns a character data type.
 <pattern>: Substring to be searched for and replaced in the subject, which can be a
literal value, a column name, or an expression that returns a character data type.
 <replacement>: Substring to replace the matched pattern in the subject. This can also be
a literal value, a column name, or an expression that returns a character data type.
Note:
 <pattern> is matched using literal strings, not regular expressions.
 <replacement> can also be an empty string ('') to remove the pattern from the subject.
Snowflake REPLACE() returns a new string value with all occurrences of the pattern replaced
by the replacement. The function preserves the data type of the subject. If any of the arguments
is NULL, the function returns NULL. We will discuss in detail in a later section how Snowflake
REPLACE() handles NULL values.
Here is one simple example of using the Snowflake REPLACE() function:
SELECT REPLACE('Slash your Snowflake spend with ---', '---', 'Chaos
Genius') as result;
Using Snowflake REPLACE to replace the string value
As you can see, Snowflake REPLACE function replaces all occurrences of the string “---” in the
subject “Slash your Snowflake spend with —” with “Chaos Genius”, and returns the new
string “Slash your Snowflake spend with Chaos Genius”.
Snowflake REPLACE() also helps you to remove all the occurrences of a specified substring by
using an empty string ('') as the replacement argument.
For example, if you want to remove “with Chaos Genius” from a string, you can use the
following query:
SELECT REPLACE('Slash your Snowflake spend with Chaos Genius', 'with Chaos
Genius', '') AS result;

Removing all the occurrences of a specified substring by using an empty string as a replacement
argument - Snowflake REPLACE
As you can see, Snowflake REPLACE replaces the substring “with Chaos Genius” in the
subject “Slash your Snowflake spend with Chaos Genius” with an empty string, and returns
the new string “Slash your Snowflake spend”.
Now, in the next section, we will compare the Snowflake REPLACE() function with another
similar function, REGEXP_REPLACE()
What Is the Difference Between Snowflake REPLACE and REGEXP_REPLACE?
Snowflake REPLACE() is not the only way to replace substrings in a string value. There is
another similar function called REGEXP_REPLACE that also allows you to perform
replacement operations but with some additional features and flexibility.
Syntax for REGEXP_REPLACE() is straightforward:
REGEXP_REPLACE( <subject> , <pattern> [ , <replacement> , <position> ,
<occurrence> , <parameters> ] )
Snowflake REGEXP_REPLACE function takes 6 (4 of ‘em are optional) arguments:
 <subject>: String value to be searched and modified.
 <pattern>: Regular expression to be searched for and replaced in the subject.
 <replacement>: Substring to replace the matched pattern in the subject. If an empty
string is specified, the function removes all matched patterns and returns the resulting
string.
 <position>: Number of characters from the beginning of the string where the function
starts searching for matches.
 <occurrence>: Number of occurrences of the pattern to be replaced in the subject. If 0 is
specified, all occurrences are replaced.
 <parameters>: A string of characters that specify the behavior of the function. This is
an optional argument that supports one or more of the following characters:
 c: Enables case-sensitive matching.
 i: Enables case-insensitive matching.
 m: Enables multi-line mode. By default, multi-line mode is disabled.
 e: Extracts sub-matches.
 s: Enables the POSIX wildcard character. to match \n. By default, wildcard character
matching is disabled.
For more details, see regular expression parameters
Now, let's dive into the main difference between Snowflake REPLACE and
REGEXP_REPLACE. The primary difference is that Snowflake REGEXP_REPLACE uses
regular expressions to match the pattern, while Snowflake REPLACE uses literal strings. Here is
a table that quickly summarizes the differences between the Snowflake REPLACE and
REGEXP_REPLACE functions.

Snowflake REPLACE Snowflake REGEXP_REPLACE

Snowflake REPLACE() uses simple substring Snowflake REGEXP_REPLACE() uses regular


matching expressions (regex) for more complex pattern matc

Syntax: REGEXP_REPLACE( <subject> , <pattern


Syntax: REPLACE( <subject> , <pattern> [ ,
<replacement> , <position> , <occurrence> ,
<replacement> ] )
<parameters> ] )

It has 3 required arguments It has 6 arguments, most of which are optional

Snowflake REPLACE() replaces all occurrences of REGEXP_REPLACE() returns the subject with t
a specified substring, and optionally replaces them specified pattern either removed or replaced by
with another string replacement string

TLDR; Snowflake REPLACE() is best for simple substring substitutions on string


data. REGEXP_REPLACE() is more advanced and configurable, but also more
complex.
How does Snowflake REPLACE Function Handle Null Values?
Snowflake REPLACE function treats NULL values in a special way. If any of the arguments of
the function is NULL, the function returns NULL as the result, meaning that the function does
not perform any replacement operation on the subject if the pattern or the replacement is NULL.
Similarly, the function does not return any value if the subject itself is NULL.
For example:
SELECT REPLACE(NULL, 'some_string', 'new_string');
Snowflake REPLACE example

Handling NULL
values with Snowflake REPLACE
SELECT REPLACE('string', NULL, 'new_string');
Snowflake REPLACE example

Handling NULL values with


Snowflake REPLACE
SELECT REPLACE('string', 'some_string', NULL);
Snowflake REPLACE example

Handling NULL values


with Snowflake REPLACE
Practical Examples of Snowflake REPLACE Function
Now, in this section, we will see some practical examples of using the Snowflake REPLACE
function to perform various data manipulation and transformation tasks. We will use a sample
table called gadgets and insert some dummy data into it:
CREATE TABLE gadgets (
id NUMBER,
name VARCHAR,
description VARCHAR,
price NUMBER
);

INSERT INTO gadgets


VALUES
(1, 'IPhone 15', 'Latest smartphone from Apple ', 1799),
(2, 'Samsung Galaxy', 'Flagship smartphone from Samsung', 799),
(3, 'MacBook Pro', 'Powerful laptop from Apple with a 13-inch display
and M1 chip', 1299),
(4, 'Dell XPS', 'Sleek laptop from Dell with Intel Core i7 processor',
1299),
(5, 'AirPods', 'Wireless earbuds from Apple with ANC', 300),
(6, 'Kindle', 'E-reader from Amazon with an awesome display', 429),
(7, 'Xiaomi Tablet', 'Tablet from Xiaomi with 10-inch display', 749);
Snowflake REPLACE example

Creating gadgets table and inserting some dummy data into it - Snowflake REPLACE
Here is what our gadgets table looks like:

Selecting all the records of the Snowflake gadgets table - Snowflake REPLACE
Example 1—Basic Usage of Snowflake REPLACE Function
Now, we will use the Snowflake REPLACE function to replace substrings in a string. Suppose
we want to change the name of the product “IPhone 15 to “Apple IPhone 19” in the name
column.
We can use the following query:
SELECT id, REPLACE(name, 'IPhone 15', 'Apple IPhone 19') AS name,
description, price
FROM gadgets
WHERE id = 1;
Snowflake REPLACE example

Using the Snowflake REPLACE function to replace substrings in a string


As you can see, Snowflake REPLACE function replaces the substring “iPhoneIPhone 15” in the
name column with the substring “Apple IPhone 19”, and returns the new string “Apple
iPhoneIPhone 19”.
Example 2—Removing Prefixes/Suffixes
Next, we will use the Snowflake REPLACE function to remove prefixes or suffixes from a
string. Suppose we want to remove the gadget names from the name column and only keep the
model names.
We can use the following query:
SELECT id, REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(name, 'Apple ', ''),
'Samsung ', ''), 'Xiaomi ', ''), 'Dell ', ''), 'Amazon ', '') AS name,
description, price
FROM gadgets
WHERE id IN (1, 2, 3, 4, 5, 6, 7);
Snowflake REPLACE example

Removing Prefixes/Suffixes with Snowflake REPLACE


As you can see, Snowflake REPLACE function removes the substrings “Apple”, ”Samsung”,
“Xiaomi”, “Dell” and “Amazon” from the gadgets name column by replacing them with empty
strings (''), and returns the new strings with only the gadgets model names.
Example 3—Handling null values with Snowflake REPLACE()
As we saw earlier, NULL values result in NULL outputs:
SELECT id, REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(NULL, 'Apple ', ''),
'Samsung ', ''), 'Xiaomi ', ''), 'Dell ', ''), 'Amazon ', '') AS name,
description, price
FROM gadgets
WHERE id IN (1, 2, 3, 4, 5, 6, 7);
Snowflake REPLACE example

Handling null values with Snowflake REPLACE()


As you can see, if any of the arguments of the function is NULL, the function returns NULL as
the result.
Example 4—Standardizing Data Formats
Now, in this particular example, we will use the Snowflake REPLACE function to standardize
the data formats in a string. Suppose we have another table called gadgets_orders and insert
some dummy data into it::
CREATE TABLE gadgets_order (
id NUMBER,
customer_id NUMBER,
product_id NUMBER,
order_date VARCHAR,
order_amount VARCHAR
);

INSERT INTO gadgets_order


VALUES
(1, 11, 2, '2024-01-01', '1799.00'),
(2, 22, 1, '01/02/2024', '799.00 USD'),
(3, 33, 3, '01-03-2024', '1299.00 US Dollars'),
(4, 44, 4, '2024/01/04', '1299.00 Dollars'),
(5, 55, 5, '2024/01/04', '300.00 Dollars'),
(6, 66, 7, '2024/01/04', '429.00 Dollars'),
(7, 77, 6, '2024/01/04', '749.00 Dollars');
Snowflake REPLACE example

Creating gadgets_order
table and inserting some dummy data into it - Snowflake REPLACE
Here is what our gadgets_order table looks like:

Selecting all the records of Snowflake gadgets_order table - Snowflake REPLACE


Suppose we want to standardize the format of the order_date column to YYYY-MM-DD, and
the format of the order_amount column to $XXX.XX. We can use the following query:
SELECT id, customer_id, product_id,
REPLACE(
REPLACE(order_date, '/', '-'),
'.',
'-'
) AS order_date,
'$' || REPLACE(
REPLACE(
REPLACE(
REPLACE(order_amount, '$', ''),
'USD',
''
),
'Dollars',
''
),
'US',
''
) AS order_amount
FROM
gadgets_order;
Snowflake REPLACE example

Standardizing Data and Price Formats with Snowflake REPLACE


As you can see, Snowflake REPLACE function replaces the different separators and currency
symbols in the order_date and order_amount columns with the standard ones and returns the
new strings.
What Is the Difference Between TRANSLATE() and Snowflake REPLACE()?
Both Snowflake REPLACE() and TRANSLATE() are String functions that can substitute
characters within strings, but there are some notable differences:

Snowflake REPLACE() Snowflake TRANSLATE()

Snowflake REPLACE() works with strings of any TRANSLATE() performs single character
length substitutions

Syntax: REPLACE( <subject> , <pattern> [ , Syntax: TRANSLATE( <subject>,


<replacement> ] ) <sourceAlphabet>, <targetAlphabet> )

Snowflake REPLACE() allows more control over the


TRANSLATE() has a simpler syntax
find-and-replace logic
Snowflake REPLACE() does a single find-and- TRANSLATE() makes multiple translations in o
replace operation

TLDR; TRANSLATE() is best for fast bulk character substitutions, while Snowflake
REPLACE() enables more advanced string manipulation with greater flexibility.
When to Use Snowflake REPLACE Function?
Finally, in this section, we will discuss when to use the Snowflake REPLACE function and
explore its benefits and limitations. Here are common use cases where Snowflake REPLACE()
can be useful for:
1) Removing or Replacing Substrings
Use Snowflake REPLACE to eliminate unwanted characters, words, or phrases from a string by
substituting them with an empty string (''). Also, it helps in replacing existing substrings and
altering names, formats, or even styles within a string.
2) Data Cleansing and Transformation
You can use Snowflake REPLACE to correct spelling errors, typos, or inconsistencies within
your data. Standardize data formats like dates, numbers, or currencies, thereby enhancing data
quality and accuracy.
3) Dynamic String Manipulation
Use Snowflake REPLACE for string operations based on expressions or variables. For instance,
concatenate strings, split or extract substrings, and generate new strings based on conditions or
logic.
4) Simple String Alterations
You can implement Snowflake REPLACE for straightforward changes to a string, such as
adding or removing prefixes or suffixes, altering the case, or reversing the order. This function
streamlines string modifications easily.
5) Control Over Replacement Logic
You can use the REPLACE function to customize the replacement operation according to your
needs. You can control the case sensitivity, the number of replacements, and the position of the
replacement within the string. The REPLACE function gives you more flexibility and
customization options than other similar functions (Like TRANSLATE()).
Conclusion
And that’s a wrap! Snowflake REPLACE() is a powerful function that allows you to replace all
occurrences of a specified substring in a string value with another substring. Snowflake
REPLACE() can help you perform various data manipulation and transformation tasks, such as
cleansing, standardization, extraction—and analysis. As we saw in the examples above,
Snowflake REPLACE() can be used for simple tasks like correcting typos as well as more
complex tasks like dynamic string manipulation. But, you should always be aware of how
Snowflake REPLACE() handles the null values and the case sensitivity of the arguments.
In this article, we covered:
 What Is Snowflake REPLACE() Function?
 How Does a Snowflake REPLACE() Function Work?
 What Is the Difference Between Snowflake REPLACE and REGEXP_REPLACE?
 How does Snowflake REPLACE Function Handle Null Values?
 Practical Examples of Snowflake REPLACE Function
 What Is the Difference Between TRANSLATE() and Snowflake REPLACE()?
 When to Use Snowflake REPLACE Function?
…and so much more!
By now, you should be able to use Snowflake REPLACE to manipulate and transform your
string data effectively. It's simple to use, yet customizable for diverse needs, helping you to
effectively cleanse, transform—and standardize your string/text data.
FAQs
What is the Snowflake REPLACE() function?
Snowflake REPLACE() is a string function that finds and replaces a specified substring with a
new substring in a string value. It replaces all occurrences of the specified substring.
Does Snowflake REPLACE() replace all occurrences or just the first one?
Snowflake REPLACE() will replace all occurrences of the specified substring, not just the first
match.
Is Snowflake REPLACE() case-sensitive?
Yes, Snowflake REPLACE() performs case-sensitive matches by default.
How does Snowflake REPLACE() handle NULL values?
If any Snowflake REPLACE() argument is NULL, it returns NULL without performing any
replace.
Can Snowflake REPLACE() insert new characters?
Yes, the replacement string can contain new characters not originally present.
Can Snowflake REPLACE() be used to remove substrings?
Yes, you can remove substrings by replacing ‘em with an empty string.
When would Snowflake REPLACE() be useful for data cleansing?
Snowflake REPLACE can correct invalid data entries, standardize formats, and fix
typos/inconsistencies to improve data quality.
Can I use column values or expressions as arguments in Snowflake REPLACE()?
Yes, you can use column names or expressions that evaluate to a string instead of just literal
strings.
Is there a limit on the string length supported by Snowflake REPLACE()?
The maximum string length is 16MB (16777216 characters) which is the Snowflake
STRING/VARCHAR limit.
How is Snowflake TRANSLATE() different from Snowflake REPLACE()?
TRANSLATE does single-character substitutions while Snowflake REPLACE works on entire
strings.
Can I use Snowflake REPLACE() to concatenate or split strings?
Yes, Snowflake REPLACE can be used alongside other string functions like CONCAT or SPLIT
for such operations.
When should I avoid using Snowflake REPLACE()?
Avoid Snowflake REPLACE if you need very high performance—use TRANSLATE instead.
Also if you need more advanced regex patterns use the REGEXP_REPLACE function.
Is Snowflake REPLACE() case-sensitive by default?
Yes, Snowflake REPLACE performs case-sensitive literal substring matching by default.
Can I make Snowflake REPLACE() case-insensitive?
Yes, you can by converting the string to the same case before applying REPLACE.
Does Snowflake REPLACE() support regex or wildcards?
No, Snowflake REPLACE does not support regex or wildcards, only literal substring matching.
Can Snowflake REPLACE() insert strings that don't exist in the original?
Yes, the replacement string can contain new characters not originally present.
What data types does Snowflake REPLACE() support?
Snowflake REPLACE works on STRING, VARCHAR, CHAR, TEXT, and similar
string/character data types.
What Is the Difference Between Snowflake IFF and Snowflake CASE?
IFF is another Snowflake conditional expression function similar to Snowflake CASE. The key
differences are:

Snowflake CASE Snowflake IFF

It is simply a single-level
Snowflake CASE handles Complex conditional logic
then-else expression

Evaluates all conditions Returns first match

Can check multiple values Cannot check multiple valu

Syntax: CASE WHEN <condition1> THEN <result1> [ WHEN Syntax: IFF( <condition>
<condition2> THEN <result2> ] [ ... ] [ ELSE <result3> ] END <expr1> , <expr2> )

More extensible and customizable Less code for simple scenar

When to Use Snowflake CASE Statement?


Snowflake CASE statement is a powerful and versatile function that can help you perform
various tasks in data analysis and decision-making. You can use Snowflake CASE statement in
many situations, such as:
1) Avoiding Nested IFs Statements
When dealing with multiple conditions, using nested IFs statements can become confusing.
Snowflake CASE statement offers a solution to simplify your logic, making your code more
readable and maintainable. It replaces nested IF statements with a single expression capable of
handling various scenarios and outcomes.
2) Classifying Data Into Categories
For large, datasets requiring grouping or labeling based on specific criteria, Snowflake CASE
statement helps in creating categories and assigning values. It enables comparisons between
values, returning different results based on the comparison. On top fo that, it also facilitates
creating ranges or intervals and delivering distinct outcomes based on these ranges.
3) Replace Missing Values and Handling Nulls
If you have a data set that contains missing or unknown values, Snowflake CASE statement can
help you deal with them and avoid errors or exceptions. You can use Snowflake CASE statement
to check for null values and replace them with default or custom values. You can also use
Snowflake CASE statement to handle errors and exceptions and return appropriate values or
messages.
4) Complex Data Transformations and Multi-Step Calculations
For datasets necessitating intricate transformations or multi-step calculations involving various
conditions, Snowflake CASE statement can help you perform them efficiently and accurately.
You can use Snowflake CASE statement to evaluate expressions and return values based on
whether they are true or false. It also facilitates calculations, applying different formulas based
on input. You can also use nested Snowflake CASE statements to create multi-level conditions
and logic.
Conclusion
And that’s a wrap! Snowflake CASE statement is a flexible and powerful function that can help
you write more concise and elegant code that can handle various scenarios and outcomes. It
manages conditional logic in a highly versatile way without convoluted nested IFs. Thus, a
thorough understanding of how to apply CASE effectively can significantly enhance code
efficiency and readability.
In this article, we have covered:
 What Is Snowflake CASE Statement?
 How Does a Snowflake CASE Statement Work?
 What Is the Alternative to CASE in Snowflake?
 How Snowflake CASE Statement Handles Null Values?
 Practical Examples of Snowflake CASE Statement
 What Is the Difference Between IFF and Snowflake CASE?
 When to Use Snowflake CASE Statement?
—and so much more!!
Using Snowflake CASE is like having an advanced multi-level decision maker that can route
data and logic based on any criteria you define. Think of CASE statements as powerful SQL
switches that go far beyond basic IF-THEN logic.
Want to take Chaos Genius for a spin?
Start 7 day free trial

FAQs
What is a Snowflake CASE statement?
Snowflake CASE is a conditional expression in Snowflake that allows you to perform different
computations based on certain conditions.
How does CASE work in Snowflake?
Snowflake CASE evaluates conditions in sequence and returns the result of the first matching
condition. An optional ELSE clause specifies a default result.
Can I use Snowflake CASE in a SELECT query in Snowflake?
Yes, Snowflake CASE can be used in SELECT, INSERT, UPDATE and other statements
anywhere an expression is valid.
How do I check for NULL values in a CASE statement?
You can use IS NULL or IS NOT NULL to explicitly check for nulls, as nulls do not match
other nulls in CASE.
Can I nest Snowflake CASE statements in Snowflake?
Yes, you can nest Snowflake CASE statements to create multi-level conditional logic.
What is an alternative to CASE in Snowflake?
DECODE is an alternative that compares an expression to a list of values to return a match.
When should I use Snowflake CASE instead of DECODE in Snowflake?
Snowflake CASE provides more flexibility for complex logic with multiple conditions.
DECODE is simpler for basic value matching.
What is the difference between CASE and IFF in Snowflake?
CASE evaluates all conditions, IFF evaluates only the first match. CASE can check multiple
values, IFF cannot.
How can I handle errors with Snowflake CASE?
Always check for errors and exceptions explicitly in Snowflake CASE conditions, and return
appropriate messages.
Can Snowflake CASE improve performance ?
Snowflake CASE can improve readability over nested IFs. However, joining may be faster than
complex Snowflake CASEs.
What are some common uses of CASE in Snowflake?
Data transformations, conditional aggregation, pivoting, error handling, and business logic.
What data types can I use with Snowflake CASE?
Snowflake CASE results and return values can be any Snowflake data type.
Is Snowflake CASE statement support standard SQL?
Yes, CASE conditional expressions are part of the ANSI SQL standard.
Are there any limitations with Snowflake CASE?
No major limitations. Just watch for performance with over complex logic.
Can I use subqueries in a Snowflake CASE?
Yes, Snowflake supports using subqueries in CASE conditions and return values.
Advanced Snowflake Interview Questions and Answers:
Going beyond the basics, these advanced Snowflake interview questions probe deeper into
your understanding and expertise:
Performance Optimization:
 How would you optimize a slow-running query in Snowflake? Analyze
the query plan, identify bottlenecks (e.g., joins, subqueries), and
consider: Clustering: Cluster tables based on frequently used join
columns. Materialized views: Pre-calculate frequently used queries for
faster access. Partitioning: Divide data into smaller subsets based on
specific criteria. Indexing: Create indexes on frequently queried columns.
 Explain Snowflake's various caching mechanisms and how they impact
performance. Discuss result cache, local disk cache, and remote disk cache.
Explain how they store frequently accessed data for faster retrieval and
reduce query execution time.
 How would you monitor and troubleshoot performance issues in
Snowflake? Utilize Snowflake monitoring tools like Query History,
Warehouse History, and Workloads to identify slow queries, resource
usage, and bottlenecks.
Security & Compliance:
 Describe your approach to implementing data security and access
control in Snowflake. Discuss strategies for user authentication,
authorization (roles, privileges), data encryption, and activity monitoring.
Mention relevant security features like Dynamic Data Masking and Row
Level Security.
 How would you ensure compliance with data privacy regulations like
GDPR or CCPA in Snowflake? Explain data anonymization and
pseudonymization techniques. Discuss how Snowflake's features like Data
Masking and Secure Data Sharing can help comply with regulations.
 How would you handle a potential data breach in Snowflake? Describe
your incident response plan, including data loss assessment, notification
procedures, and remediation steps. Mention Snowflake's security features
like Security Alerts and User Activity Logs that can aid in such situations.
Advanced Features & Use Cases:
 Explain the concept of Snowflake Streams and how you would use
them for real-time data processing. Discuss how Streams ingests and
processes data in real-time, enabling applications like anomaly detection
and fraud prevention.
 Describe Snowflake's capabilities for machine learning and data
science. Explain Snowflake's integration with external ML tools and its
native features like User Defined Functions (UDFs) for building and
deploying ML models.
 How would you design a Snowflake data pipeline for a complex data
processing scenario? Discuss your approach, including data sources,
ETL/ELT tools, data transformation steps, and Snowflake integration.
Showcase your understanding of data pipelines and their role in modern
data architectures.
Remember:
 Go beyond basic definitions and provide detailed explanations with
practical examples.
 Showcase your problem-solving skills by explaining your thought
process and approach.
 Demonstrate your understanding of the latest Snowflake features and
functionalities.
 Be confident and articulate when expressing your knowledge and
expertise.
By preparing for these advanced questions and showcasing your in-depth understanding of
Snowflake, you can impress interviewers and solidify your position as a sought-after
Snowflake professional.
1. What are the different types of Stages?
Stages are commonly referred to as the storage platform used to store the files.
In Snowflake, there are two types of stages:
1. Internal stage — Resides in the Snowflake storage
2. External stage — Resides in any of the cloud object storage (AWS S3, Azure
Blob, GCP bucket )
Data can be retrieved from the stage or transferred to the stage using the COPY
INTO command.
For BULK loading you can use COPY INTO and for continuous data loading
you need to use SNOWPIPE, an autonomous service provided by Snowflake.
To load data from the local file system into snowflake you can use
the PUT command.
2. What is Unique about Snowflake Cloud Data Warehouse?
Snowflake has introduced many unique features that are not been used in any
of the other data warehouses currently in the market.
1. Totally cloud-agnostic ( SAAS )-
Snowflake relies on 3 cloud service providers (AWS, Azure, GCP) for its
underlying infrastructure. It provides true SAAS functionality where the user
does not require to download or install any kind of software to use snowflake or
need to worry about any kind of hardware.
2. Decoupled storage and compute-
By decoupled it means storage and computes are work separately and work
collaboratively with the interface provided by the cloud provider. This helps in
decreasing usage costs where the user pays only what he is using.
3. Zero copy cloning-
This feature is used to take a snapshot of the table at the current instance to take
a backup of the table. The snapshot taken will not consume any physical space
in the data storage unless any changes have been done on the clone object. This
will occupy the same columnar partition used by the source table. Once changes
are done on the cloned object they will be stored in the different micro
partitions.
4. Secure data sharing-
This feature provides secure sharing of the data with different snowflake
accounts or users outside of the snowflake account. By secure, it means you can
assign authorized users to access any particular table in order to keep the table
secured from the rest of the snowflake users. The shared objects are always in
Read-only mode. You can create a Reader account to share data with the user
who is not using Snowflake.
5. Supports semi-structured data-
Snowflake supports file formats such as JSON, AVRO, ORC, PARQUET, and
XML. The variant data type is used to load semi-structured data into snowflake.
Once loaded it can be separated into multiple columns as a table.
The variant has a limit of 16MB for an individual row. Flatten function is used
to split the nested attributes into separate columns.
6. Scalability-
As Snowflake is built upon cloud infrastructure, it uses cloud services for
storage and computing. The warehouse is a VM that is used to carry out the
computation required to execute any query. This enables users the ability to
scale up resources when they need large amounts of data to be loaded faster
and scale back down when the process is finished without any interruption to
service.
7. Time-travel and Failsafe-
Time-travel is to retrieve snowflake objects which are removed/dropped from
snowflake. You can read/retrieve data that is deleted within a permissible time
frame using time travel.

CDP lifecycle
Using Time Travel, you can perform the following actions within a defined period of time:
1. Query data in the past that has since been updated or deleted.
2. Create clones of entire tables, schemas, and databases at or before specific points in the
past.
3. Restore tables, schemas, and databases that have been dropped.
3. What are the different ways to access the Snowflake Cloud Data warehouse?
Snowflake provides WebUI to access snowflake as well as SnowSQL to execute SQL queries
and perform all DDL and DML operations including data loading and unloading. It also
provides native connectors for Python, Spark, Go, Nodejs, JDBC, and ODBC.
4. What are the data security features in Snowflake?
Snowflake provides below security features:
1. Data encryption
2. Object-level access
3. RBAC
4. Secure data sharing
5. Masking policies for sensitive data
5. What are the benefits of Snowflake Compression?
Snowflake stores files in storage as compressed by default as gzip format which helps to
reduce the storage space occupied by that file also improves the data loading and unloading
performance. It also detects compressed file formats such as gzip,bzip2,deflate,raw_deflate.
6. What is Snowflake Caching? What are the different types of caching in
Snowflake?
It comprises three types of caching :
1. Result cache- This holds the results of every query executed in the past 24 hours.
2. Local disk cache- This is used to cache data used by SQL queries. Whenever data is
needed for a given query it’s retrieved from the Remote Disk storage, and cached in SSD
and memory.
3. Remote cache- Which holds the long-term storage. This level is responsible for data
resilience, which in the case of Amazon Web Services, means 99.999999999% durability.
Even in the event of an entire data center failure.
Snowflake cache
7. Is there a cost associated with Time Travel in Snowflake?
Yes, Time travel is the feature provided by snowflake to retrieve data that is removed from
Snowflake databases.
using time travel you can do :
1. Query data in the past that has since been updated or deleted.
2. Create clones of entire tables, schemas, and databases at or before specific points in the
past.
3. Restore tables, schemas, and databases that have been dropped.
Once the Time travel period is over, data is moved to the Fail-safe zone.
For the snowflake standard edition, the default Time travel period is 1.
For the snowflake Enterprise edition,
for transient and temp DB, schema, tables, the default time travel period is 1.
for permanent DB, schema, tables, and views, the default time travel can ranges from 1to 90
days.
8. What is fail-safe in Snowflake
When the time-travel period elapses, removed data moves to Fail-safe zone of 7 days for
Ent. edition snowflake and above. Once data went to Failsafe, we need to contact Snowflake
in order to restore the data. It may take from 24 hrs to days to get the data. The charges will
occur from where the state of the data is changed on basis of 24 Hr.
9. What is the difference between Time-Travel vs Fail-Safe in Snowflake
Time travel has a time period ranging from 0 to 90 days for permanent DB, schema, and
tables where Fail safe time is of 7 days only.
Once the table/schema is dropped from the SF account it will get into Time travel according
to the time travel duration of that object (0–90) days.
Once TT is elapsed, objects move into the Fail-safe zone.
Snowflake provides us with 3 methods of time travel –
a. Using Timestamp — We can do time travel to any point of time before or after the
specified timestamp.
b. Using Offset — We can do time travel to any previous point in time.
c. Using Query ID — We can do time travel to any point of time before or after the specified
Query ID.
Now lets drop the table :

Try reading the table again:


Now we can recover the dropped table using UNDROP :

and then we can read the data from the table again :
10. How does zero-copy cloning work and what are its advantage
Zero copy cloning is just like a creating clone of the snowflake object.
You can create clones of SF objects such as DB, schema, table, stream, stage, file formats,
sequence, and task.
when you create a clone, Snowflake will point the metadata of the source object to cloned
object depicting cloning until you make any changes to cloned object.
1. Main advantage of this is it creates a copy of the object in less time.
2. It does not consume any extra space if no updates happen on the cloned object.
3. fast way to take backup of any object.
Syntax:
create table orders_clone clone orders;
11. What are Data Shares in Snowflake?
Data sharing is the feature provided by snowflake to share data across snowflake accounts
and people outside of the snowflake accounts. You can share data according to the customized
datasets shared. For people outside snowflake, you need to create a reader account with
access to only read the data.
Below are the objects that can be shared:
Tables
External tables
Secure views
Secure materialized views
Secure UDFs
There are two types of users:
1. Data provider: The provider creates a share of a database in their account and grants access
to specific objects in the database. The provider can also share data from multiple databases,
as long as these databases belong to the same account.
2. Data consumer: On the consumer side, a read-only database is created from the share.
Access to this database is configurable using the same, standard role-based access control that
Snowflake provides for all objects in the system.
12. What is Horizontal scaling vs Vertical scaling in Snowflake.
Snowflake Enterprise and the above versions support a multi-cluster warehouse where you
can create a multi-cluster environment to handle scalability. The warehouse can be scaled
horizontally or vertically.
The multicluster warehouse can be configured in two ways :
1. Maximized mode: Where min amount and max amount of clusters are the same but (1 <
cluster size ≤10)

Maximized mode
2. Auto-Scale mode: Where min amount and max amount of clusters are different ( min = 2
and Max=10)
Auto-Scale mode
You can manually change your warehouse according to your query structure and complexity.
Below are the scaling methods available in snowflake.
Vertical scaling :
scaling up: Increasing the size of the warehouse (small to medium)
scaling down: decreasing the size of the warehouse (medium to small)
Horizontal scaling:
Scaling in:
Removing unwanted clusters from warehouse limit. (4 → 2)
scaling out:
Adding more clusters to the existing list of warehouses. (2 → 4)
12. Where is metadata stored in Snowflake?
Once the table is created in Snowflake, it generates metadata bout the table containing a
count of the rows, the date-time stamp on which it gets created, and aggregate functions
such as sum, min, and a max of numerical columns.
Metadata is stored in S3 where snowflake manages the data storage.
that's why while querying the metadata, there is no need of running a warehouse.
13. Briefly explain the different data security features that are available in
Snowflake
Multiple data security options are available in snowflake such as :
1. Secure view
2. Reader account
3. Shared data
4. RBAC
14. What are the responsibilities of a storage layer in Snowflake?
The storage layer is nothing but the cloud storage service where data resides.
It has responsibilities such as :
1. Data protection
2. Data durability
3. Data Encryption
4. Archival of Data
15. Is Snowflake an MPP database
Yes. By MPP it means Massively Parallel processing. Snowflake is built on the cloud so it
inherits the characteristics of the cloud such as scalability. It can handle parallel running
queries by adding necessary compute resources.
Snowflake supports shared-nothing architecture where the compute env is shared between
the users. When the query load increases, it automatically creates multiple clusters on nodes
capable of handling the complex query logic and execution.
16. Explain the different table Types available in Snowflake:
It supports three types of tables :
1. Permanent :
Permanent tables are the default type of tables getting created in snowflake. It occupies the
storage in cloud storage. The data stored in a permanent table gets partitioned into micro-
partitions for better data retrieval. This type of table has better security features such as Time
travel
The default time travel period for the permanent table is 90 days.
2. Temporary: Unlike permanent tables, temporary tables do not occupy the storage. All the
data stays temporarily in the memory. It holds the data only for that particular session.
3. Transients: Transient tables are similar to temporary with respect to the time travel period
but the only difference is transient tables need to be dropped manually. They will not get
dropped until explicitly dropped.
17. Explain the differences and similarities between Transient and Temporary
tables

18. Which Snowflake edition should you use if you want to enable time travel for
up to 90 days :
The Standard edition supports the time travel period of up to 1 day. For time travel of more
than 1 day for the permanent table, we need to get a Snowflake edition higher than standard.
All snowflake editions support only one day of time travel by default.
19. What are Micro-partitions :
Snowflake has its unique way of storing the data in cloud storage. Snowflake is a columnar
data warehouse as it stores data in columnar format. By columnar, it means instead of storing
data row-wise it split the table into columnar chunks called Micro-partitions. Why micro
because it only limits each partition to be 50 to 500 MB.
Snowflake doesn’t support indexing instead it manages the metadata of each micro-partition
to retrieve data faster. A relational database when queried uses indexes to traverse all the
rows to find requested data. The overhead of reading all the unused data causes the data
retrieval time consuming and compute-heavy. Contrary to relational DB, snowflake uses the
metadata of MP and checks which chunk or MP contains the data requested by the user.
Metadata content the offset and the number of rows consist in that particular micro partition.
Using the metadata, snowflake manages all micro-partitions for data storage and retrieval.
Check the snowflake doc on micro-partitions.
20. By default, clustering keys are created for every table, how can you disable
this option
When new data continuously arrived and loaded into micro-partitions some columns (for
example, event_date) have constant values in all partitions (naturally clustered), while
other columns (for example, city) may have the same values appearing over and over in all
partitions.
Snowflake allows you to define clustering keys, one or more columns that are used to co-
locate the data in the table in the same micro-partitions.
To suspend Automatic Clustering for a table, use the ALTER TABLE command with
a SUSPEND RECLUSTER clause. For example:
alter table t1 suspend recluster;
To resume Automatic Clustering for a clustered table, use the ALTER TABLE command with
a RESUME RECLUSTER clause. For example:
alter table t1 resume recluster;
21. What is the default type of table created in the Snowflake.
In addition to permanent tables, which is the default table type when creating tables,
Snowflake supports defining tables as either temporary or transient. These types of tables are
especially useful for storing data that does not need to be maintained for extended periods of
time (i.e. transitory data).
22. How many servers are present in X-Large Warehouse
23. As Snowflake should use one of the cloud providers (like AWS or Azure) as
part of its architecture, why can’t the AWS database Amazon Redshift can be
used instead of the Snowflake warehouse.
24. What view types can be created in Snowflake but not in traditional
databases:
Likewise tables in snowflake there are different types of views that can be created snowflake
i.e normal Views, Secure Views, and Materialized Views.
Normal views are similar to the views found in RDBMS where the output data depends on the
query it will run on a table or multiple tables. The query needs to be refreshed in order to
reflect the updated data.
Secure Views prevent users from possibly being exposed to data from rows of tables that are
filtered by the view. With secure Views, the view definition and details are only visible to
authorized users (i.e. users who are granted the role that owns the View).
A materialized view is a pre-computed dataset derived from a query specification which is
nothing but a SELECT query in its definition. The output is stored for later use.
Since the underlying data of the given query is pre-computed, querying a materialized view is
faster than executing the original query. This performance difference can be significant when
a query is run frequently or it is too complex.
25. Is Snowflake a Data Lake
A data lake is normally used for dumping all kinds of data coming from various data sources
where it can contain text data, chats, files, images, or videos. The data will be unfiltered,
unorganized, and difficult to analyze.
We cannot use this data to carry any information out of it.
On a similar basis, the snowflake is supporting structured and semi-structured data with
scalable cloud storage providing data lake features along with analytical usage of the data.
By choosing snowflake you get the best of both data lake and data warehouse.
26. What are the key benefits you have noticed after migrating to Snowflake
from a traditional on-premise database.
1. Cloud agnostic.
2. Decoupled storage and compute.
3. Highly scalable.
4. Query performance.
5. supports structured and semi-structured data.
6. Native connectors such as python, scala , R, and JDBC/ODBC.
7. Secure data sharing.
8. Materialized views.
27. When you execute a query, how does Snowflake retrieves the data as
compared to the traditional databases.
1. When end user execute any query, it first goes to cloud service layer where it get
optimized and restructured for better performance. The query will be tuned in
terms of getting data from the underlying data storage. Also the query gets
compiled by query compiler in same layer.
2. after compilation, it goes to metadata cache to check if the cache has stored any
data related to that query
28. Explain the difference between External Stages and Internal Name Stages:
Stages denotes where you want to stage (hold) the data in snowflake.
There are two types of stages exists in snowflake :
1. Internal stage :
In this stage , snowflake provide place to hold the data within itself. Data never leave
snowflake VPC in this kind of stage.
its also gets divided into sub categories as :
1. User : Each user get automatically allocated stage for data loading
2. Table : Each table get automatically allocated stage for data loading
3. Named : Named stages can be created manually for data loading.
2. External stage:
In opposite to Internal stage, external stages points to locations outsides on Snowflake. i.e.
Cloud storage buckets ( S3, GCS, Azure blob )
You must specify an internal stage in the PUT command when uploading files to Snowflake.
You must specify the same stage in the COPY INTO <table> command when loading data
into a table from the staged files.
29. Explain the difference between User and Table Stages.
User stages:
Each user has a Snowflake stage allocated to them by default for storing files. This stage is a
convenient option if your files will only be accessed by a single user, but need to be copied into
multiple tables.
User stages have the following characteristics and limitations:
 User stages are referenced using @~; e.g. use LIST @~ to list the files in a user
stage.
 Unlike named stages, user stages cannot be altered or dropped.
 User stages do not support setting file format options. Instead, you must specify
file format and copy options as part of the COPY INTO <table> command.
This option is not appropriate if:
 Multiple users require access to the files.
 The current user does not have INSERT privileges on the tables the data will be
loaded into.
Table stage:
Each table has a Snowflake stage allocated to it by default for storing files. This stage is a
convenient option if your files need to be accessible to multiple users and only need to be
copied into a single table.
Table stages have the following characteristics and limitations:
 Table stages have the same name as the table; e.g. a table named mytable has a
stage referenced as @%mytable.
 Unlike named stages, table stages cannot be altered or dropped.
 Table stages do not support transforming data while loading it (i.e. using a query
as the source for the COPY command).
Note that a table stage is not a separate database object; rather, it is an implicit stage tied to
the table itself. A table stage has no grantable privileges of its own. To stage files to a table
stage, list the files, query them on the stage, or drop them, you must be the table owner (have
the role with the OWNERSHIP privilege on the table).
30. What are the constraints which are enforced in Snowflake?
Normally there no constraints are enforced in snowflake except for NOT NULL constraints,
which are always enforced.
Usually, in traditional databases, there are many constraints being used to validate or restrict
the incorrect data from being stored such as primary key, not null, Unique, etc.
Snowflake provides the following constraint functionality:
 Unique, primary, and foreign keys, and NOT NULL columns.
 Named constraints.
 Single-column and multi-column constraints.
 Creation of constraints inline and out-of-line.
 Support for creation, modification and deletion of constraints.
31. What is unique about Snowflake Vs Other Warehouses.
Please refer to question Q2.
32. How a snowflake is charging the customer?
Snowflake charges on pay
34. Do DDL commands cost you?
36. Difference between Snowflake and other databases?
37. how will you calculate the expense of query running in snowflake?
38. How to load files in Snowflake?
39. How to share a table in snowflake other than the data marketplace?
40. How does Snowflake stores data?
41. If I faced an error while loading data what will happen?
42. What is Snowpipe?
43. What is materialized view what are the drawbacks of it
44. How can you implement CDC in Snowflake?
45. What if one of the source tables added a few more columns how you will handle it at the
snowflake end.
46. How to load data from JSON to Snowflake?
47. What are secure views and why they are used? How is data privacy done here?
48. What is materialized view?
49. What are streams?
50. how you can fetch specific data from the variant columns?
51. How do you load semi-structured data in Snowflake?
52. How to create a stage in Snowflake?
53. What is clustering
54. What is automatic clustering
55. If I want to fetch data on basis of timestamp value is it feasible to cluster the data on
timestamp?
56. How will you read hierarchical JSON data, I mean in case it is having an array how would
you read that data.
57. How to disable fail-safe.
58. What is the best approach to recover the historical data at the earliest which was
accidentally deleted?
59. You have created a warehouse using the command create or replace warehouse
OriginalWH initially_suspended=true; What will be the size of the warehouse?
Scenario-Based Questions:
1. You have observed that a store procedure that is getting executed daily at 7 AM as part of
your batch process is consuming resources and the CPU I/O is showing as 90%, and the other
jobs which are getting executed are impacted due to the store procedure. How can you quickly
resolve the issue with the store procedure?
2. Some queries are getting executed on a warehouse and you have executed Alter Warehouse
statement to resize the warehouse, how this will affect the queries which are already in the
execution state.
3. A new business analyst has joined your project, as part of the onboarding process you have
sent him some queries to generate some reports, the query took around 5 minutes to get
executed, the same query, when executed by other business analysts, has returned the results
immediately? What could be the Issue?
Data Build Tool (DBT) Interview Questions and Answers

nishad patkar
·
Follow
11 min read
·
Dec 21, 2023

10
Data Build Tool (DBT) is a popular open-source tool used in the data analytics and data
engineering fields. DBT helps data professionals transform, model, and prepare data for
analysis. If you’re preparing for an interview related to DBT, it’s important to be well-versed
in its concepts and functionalities. To help you prepare, here’s a list of common interview
questions and answers about DBT.
1. What is DBT?

Answer: DBT, short for Data Build Tool, is an open-source data transformation and modeling
tool. It helps analysts and data engineers manage the transformation and preparation of data
for analytics and reporting.
2. What are the primary use cases of DBT?
Answer:DBT is primarily used for data transformation, modeling, and preparing data for
analysis and reporting. It is commonly used in data warehouses to create and maintain data
pipelines.
3. How does DBT differ from traditional ETL tools?
Answer: Unlike traditional ETL tools, DBT focuses on transforming and modeling data within
the data warehouse itself, making it more suitable for ELT (Extract, Load, Transform)
workflows. DBT leverages the power and scalability of modern data warehouses and allows
for version control and testing of data models.
4. What is a DBT model?
Answer: A DBT model is a SQL file that defines a transformation or a table within the data
warehouse. Models can be simple SQL queries or complex transformations that create derived
datasets.
5. Explain the difference between source and model in DBT.
Answer: A source in DBT refers to the raw or untransformed data that is ingested into the
data warehouse. Models are the transformed and structured datasets created using DBT to
support analytics.
6. What is a DBT project?
Answer: A DBT project is a directory containing all the files and configurations necessary to
define data models, tests, and documentation. It is the primary unit of organization for DBT.
7. What is a DAG in the context of DBT?
Answer: DAG stands for Directed Acyclic Graph, and in the context of DBT, it represents the
dependencies between models. DBT uses a DAG to determine the order in which models are
built.
8. How do you write a DBT model to transform data?
Answer: To write a DBT model, you create a `.sql` file in the appropriate project directory,
defining the SQL transformation necessary to generate the target dataset.
9. What are DBT macros, and how are they useful in transformations?
Answer: DBT macros are reusable SQL code snippets that can simplify and standardize
common operations in your DBT models, such as filtering, aggregating, or renaming columns.
10. How can you perform testing and validation of DBT models?
Answer: You can perform testing in DBT by writing custom SQL tests to validate your data
models. These tests can check for data quality, consistency, and other criteria to ensure your
models are correct.
11. Explain the process of deploying DBT models to production.
Answer: Deploying DBT models to production typically involves using DBT Cloud, CI/CD
pipelines, or other orchestration tools. You’ll need to compile and build the models and then
deploy them to your data warehouse environment.
12. How does DBT support version control and collaboration?
Answer: DBT integrates with version control systems like Git, allowing teams to collaborate
on DBT projects and track changes to models over time. It provides a clear history of changes
and enables collaboration in a multi-user environment.
13. What are some common performance optimization techniques for DBT models?
Answer: Performance optimization in DBT can be achieved by using techniques like
materialized views, optimizing SQL queries, and using caching to reduce query execution
times.
14. How do you monitor and troubleshoot issues in DBT?
Answer: DBT provides logs and diagnostics to help monitor and troubleshoot issues. You can
also use data warehouse-specific monitoring tools to identify and address performance
problems.
15. Can DBT work with different data sources and data warehouses?
Answer: Yes, DBT supports integration with a variety of data sources and data warehouses,
including Snowflake, BigQuery, Redshift, and more. It’s adaptable to different cloud and on-
premises environments.
16. How does DBT handle incremental loading of data from source systems?
Answer: DBT can handle incremental loading by using source freshness checks and managing
data updates from source systems. It can be configured to only transform new or changed
data.
17. What security measures does DBT support for data access and transformation?
Answer: DBT supports the security features provided by your data warehouse, such as row-
level security and access control policies. It’s important to implement proper access controls
at the database level.
18. How can you manage sensitive data in DBT models?
Answer: Sensitive data in DBT models should be handled according to your organization’s
data security policies. This can involve encryption, tokenization, or other data protection
measures.
19. Types of Materialization?
Answer: DBT supports several types of materialization are as follows:
1)View (Default):
Purpose: Views are virtual tables that are not materialized. They are essentially saved
queries that are executed at runtime.
Use Case: Useful for simple transformations or when you want to reference a SQL query in
multiple models.

{{ config(
materialized='view'
) }}
SELECT
...
FROM ...

2)Table:
Purpose: Materializes the result of a SQL query as a physical table in your data warehouse.
Use Case: Suitable for intermediate or final tables that you want to persist in your data
warehouse.
{{ config(
materialized='table'
) }}
SELECT
...
INTO {{ ref('my_table') }}
FROM ...

3)Incremental:
Purpose: Materializes the result of a SQL query as a physical table, but is designed to be
updated incrementally. It’s typically used for incremental data loads.
Use Case: Ideal for situations where you want to update your table with only the new or
changed data since the last run.

{{ config(
materialized='incremental'
) }}
SELECT
...
FROM ...

4)Table + Unique Key:


Purpose: Similar to the incremental materialization, but specifies a unique key that dbt can
use to identify new or updated rows.
Use Case: Useful when dbt needs a way to identify changes in the data.

{{ config(
materialized='table',
unique_key='id'
) }}
SELECT
...
INTO {{ ref('my_table') }}
FROM ...

5)Snapshot:
Purpose: Materializes a table in a way that retains a version history of the data, allowing you
to query the data as it was at different points in time.
Use Case: Useful for slowly changing dimensions or situations where historical data is
important.
{{ config(
materialized='snapshot'
) }}
SELECT
...
INTO {{ ref('my_snapshot_table') }}
FROM ...

20. Types of Tests in DBT?


Answer: Dbt provides several types of tests that you can use to validate your data. Here are
some common test types in dbt:
1)Unique Key Test (unique):
Verifies that a specified column or set of columns contains unique values.

version: 2

models:
- name: my_model
tests:
- unique:
columns: [id]

2)Not Null Test (not_null):


Ensures that specified columns do not contain null values.

version: 2

models:
- name: my_model
tests:
- not_null:
columns: [name, age]

3)Accepted Values Test (accepted_values):


Validates that the values in a column are among a specified list.

version: 2

models:
- name: my_model
tests:
- accepted_values:
column: status
values: ['active', 'inactive']

4)Relationship Test (relationship):


Verifies that the values in a foreign key column match primary key values in the referenced
table.

version: 2

models:
- name: orders
tests:
- relationship:
to: ref('customers')
field: customer_id

5)Referential Integrity Test (referential integrity):


Checks that foreign key relationships are maintained between two tables.

version: 2

models:
- name: orders
tests:
- referential_integrity:
to: ref('customers')
field: customer_id

6)Custom SQL Test (custom_sql):


Allows you to define custom SQL expressions to test specific conditions.

version: 2

models:
- name: my_model
tests:
- custom_sql: "column_name > 0"

21.What is seed?
Answer: A “seed” refers to a type of dbt model that represents a table or view containing static
or reference data. Seeds are typically used to store data that doesn’t change often and doesn’t
require transformation during the ETL (Extract, Transform, Load) process.
Here are some key points about seeds in dbt:
1. Static Data: Seeds are used for static or reference data that doesn’t change
frequently. Examples include lookup tables, reference data, or any data that
serves as a fixed input for analysis.
2. Initial Data Load: Seeds are often used to load initial data into a data
warehouse or data mart. This data is typically loaded once and then used as a
stable reference for reporting and analysis.
3. YAML Configuration: In dbt, a seed is defined in a YAML file where you
specify the source of the data and the destination table or view in your data
warehouse. The YAML file also includes configurations for how the data should
be loaded.
Here’s an example of a dbt seed YAML file:

version: 2

sources:
- name: my_seed_data
tables:
- name: my_seed_table
seed:
freshness: { warn_after: '7 days', error_after: '14 days' }

22.What is Pre-hook and Post-hook?


Answer: Pre-hooks and Post-hooks are mechanisms to execute SQL commands or scripts
before and after the execution of dbt models, respectively. dbt is an open-source tool that
enables analytics engineers to transform data in their warehouse more effectively.
Here’s a brief explanation of pre-hooks and post-hooks:
1)Pre-hooks:
 A pre-hook is a SQL command or script that is executed before running dbt
models.
 It allows you to perform setup tasks or run additional SQL commands before the
main dbt modeling process.
 Common use cases for pre-hooks include tasks such as creating temporary tables,
loading data into staging tables, or performing any other necessary setup before
model execution.
Example of a pre-hook :

-- models/my_model.sql
{{ config(
pre_hook = "CREATE TEMP TABLE my_temp_table AS SELECT * FROM
my_source_table"
) }}
SELECT
column1,
column2
FROM
my_temp_table

2)Post-hooks:
 A post-hook is a SQL command or script that is executed after the successful
completion of dbt models.
 It allows you to perform cleanup tasks, log information, or execute additional
SQL commands after the models have been successfully executed.
 Common use cases for post-hooks include tasks such as updating metadata
tables, logging information about the run, or deleting temporary tables created
during the pre-hook.
Example of a post-hook :

-- models/my_model.sql
SELECT
column1,
column2
FROM
my_source_table

{{ config(
post_hook = "UPDATE metadata_table SET last_run_timestamp =
CURRENT_TIMESTAMP"
) }}

23.what is snapshots?
Answer: “snapshots” refer to a type of dbt model that is used to track changes over time in a
table or view. Snapshots are particularly useful for building historical reporting or analytics,
where you want to analyze how data has changed over different points in time.
Here’s how snapshots work in dbt:
1. Snapshot Tables: A snapshot table is a table that represents a historical state of
another table. For example, if you have a table representing customer
information, a snapshot table could be used to capture changes to that
information over time.
2. Unique Identifiers: To track changes over time, dbt relies on unique identifiers
(primary keys) in the underlying data. These identifiers are used to determine
which rows have changed, and dbt creates new records in the snapshot table
accordingly.
3. Timestamps: Snapshots also use timestamp columns to determine when each
historical version of a record was valid. This allows you to query the data as it
existed at a specific point in time.
4. Configuring Snapshots: In dbt, you configure snapshots in your project by
creating a separate SQL file for each snapshot table. This file defines the base
table or view you’re snapshotting, the primary key, and any other necessary
configurations.
Here’s a simplified example:

-- snapshots/customer_snapshot.sql

{{ config(
materialized='snapshot',
unique_key='customer_id',
target_database='analytics',
target_schema='snapshots',
strategy='timestamp'
) }}

SELECT
customer_id,
name,
email,
address,
current_timestamp() as snapshot_timestamp
FROM
source.customer;

24.What is macros?
Answer: macros refer to reusable blocks of SQL code that can be defined and invoked within
dbt models. dbt macros are similar to functions or procedures in other programming
languages, allowing you to encapsulate and reuse SQL logic across multiple queries.
Here’s how dbt macros work:
1. Definition: A macro is defined in a separate file with a .sql extension. It
contains SQL code that can take parameters, making it flexible and reusable.

-- my_macro.sql
{% macro my_macro(parameter1, parameter2) %}
SELECT
column1,
column2
FROM
my_table
WHERE
condition1 = {{ parameter1 }}
AND condition2 = {{ parameter2 }}
{% endmacro %}

2. Invocation: You can then use the macro in your dbt models by referencing it.

-- my_model.sql
{{ my_macro(parameter1=1, parameter2='value') }}

When you run the dbt project, dbt replaces the macro invocation with the actual SQL code
defined in the macro.
3. Parameters: Macros can accept parameters, making them dynamic and reusable for
different scenarios. In the example above, parameter1 and parameter2 are parameters
that can be supplied when invoking the macro.
4. Code Organization: Macros help in organizing and modularizing your SQL code. They
are particularly useful when you have common patterns or calculations that need to be
repeated across multiple models.

-- my_model.sql
{{ my_macro(parameter1=1, parameter2='value') }}

-- another_model.sql
{{ my_macro(parameter1=2, parameter2='another_value') }}

25.what is project structure?


Answer: Aproject structure refers to the organization and layout of files and directories within
a dbt project. dbt is a command-line tool that enables data analysts and engineers to
transform data in their warehouse more effectively. The project structure in dbt is designed to
be modular and organized, allowing users to manage and version control their analytics code
easily.
A typical dbt project structure includes the following key components:
1. Models Directory:
This is where you store your SQL files containing dbt models. Each model represents a logical
transformation or aggregation of your raw data. Models are defined using SQL syntax and are
typically organized into subdirectories based on the data source or business logic.
2. Data Directory:
The data directory is used to store any data files that are required for your dbt
transformations. This might include lookup tables, reference data, or any other supplemental
data needed for your analytics.
3. Analysis Directory:
This directory contains SQL files that are used for ad-hoc querying or exploratory analysis.
These files are separate from the main models and are not intended to be part of the core data
transformation process.
4. Tests Directory:
dbt allows you to write tests to ensure the quality of your data transformations.
The tests directory is where you store YAML files defining the tests for your models. Tests
can include checks on the data types, uniqueness, and other criteria.
5. Snapshots Directory:
Snapshots are used for slowly changing dimensions or historical tracking of data changes.
The snapshots directory is where you store SQL files defining the logic for these snapshots.
6. Macros Directory:
Macros in dbt are reusable pieces of SQL code. The macros directory is where you store these
macros, and they can be included in your models for better modularity and maintainability.
7. Docs Directory:
This directory is used for storing documentation for your dbt project. Documentation is
crucial for understanding the purpose and logic behind each model and transformation.
8. dbt_project.yml:
This YAML file is the configuration file for your dbt project. It includes settings such as the
target warehouse, database connection details, and other project-specific configurations.
9. Profiles.yml:
This file contains the connection details for your data warehouse. It specifies how to connect
to your database, including the type of database, host, username, and password.
10. Analysis and Custom Folders:
You may have additional directories for custom scripts, notebooks, or other artifacts related
to your analytics workflow.
Having a well-organized project structure makes it easier to collaborate with team members,
maintain code, and manage version control. It also ensures that your analytics code is
modular, reusable, and easy to understand.

my_project/
|-- analysis/
| |-- my_analysis_file.sql
|-- data/
| |-- my_model_file.sql
|-- macros/
| |-- my_macro_file.sql
|-- models/
| |-- my_model_file.sql
|-- snapshots/
| |-- my_snapshot_file.sql
|-- tests/
| |-- my_test_file.sql
|-- dbt_project.yml

26. What is data refresh?


Answer: “data refresh” typically refers to the process of updating or reloading data in your
data warehouse. Dbt is a command-line tool that enables data analysts and engineers to
transform data in their warehouse more effectively. It allows you to write modular SQL
queries, called models, that define transformations on your raw data.
Here’s a brief overview of the typical workflow involving data refresh in dbt:
1. Write Models: Analysts write SQL queries to transform raw data into analysis-
ready tables. These queries are defined in dbt models.
2. Run dbt: Analysts run dbt to execute the SQL queries and create or update the
tables in the data warehouse. This process is often referred to as a dbt run.
3. Data Refresh: After the initial run, you may need to refresh your data regularly
to keep it up to date. This involves re-running dbt on a schedule or as needed to
reflect changes in the source data.
4. Incremental Models: To optimize performance, dbt allows you to write
incremental models. These models only transform and refresh the data that has
changed since the last run, rather than reprocessing the entire dataset. This is
particularly useful for large datasets where a full refresh may be time-consuming.
5. Dependency Management: Dbt also handles dependency management. If a
model depends on another model, dbt ensures that the dependencies are run
first, maintaining a proper order of execution.
By using dbt for data refresh, you can streamline and automate the process of transforming
raw data into a clean, structured format for analysis. This approach promotes repeatability,
maintainability, and collaboration in the data transformation process.

10

Written by nishad patkar

3 Followers
Follow
More from nishad patkar

nishad patkar

How Cloud Formation Works?


AWS CloudFormation is a service provided by Amazon Web Services (AWS)
that allows you to define and provision infrastructure as code. With…

Questions to Ask in Real-Time During a Snowflake Interview.


Amulya Kumar panda
·
Follow
5 min read
·
Aug 17, 2023

21

It is important to provide clear and concise responses during job interviews with any
company.
What is a snowflake?
Snowflake is an analytical data warehouse that operates on the cloud and offers software as a
service (SaaS).
Why is Snowflake, not any other warehouse? Or What is the advanced feature of
Snowflake?
Below are fetchers available in Snowflake.
(1) Snowflake three-layer architected(cloud layer, query processing layer, storage
layer).storage and compute layer is decoupled(2)Auto scaling (3)Time travel (4) Zero copy
clone (5)Data sharing (6)Multi-language support(7) Task and strim (8) snow pipe
(9)snowpark, etc.
What type of user role is available in Snowflake?
Below are six roles available in Snowflake.
Account admin -The account admin can manage all aspects of the account.
Orgadmin- The organization administrator can manage the organization and account in the
organization.
Public -Public role is automatically available to every user in an account.
Securityadmin-Security administrator can manage the security aspects of the account.
Sysadmin-System administrator can create and manage database and warehouse.
User admin-user administrator can create and manage user and role
How to validate a file prior to loading it into a target table in Snowflake?
Before loading your data, you can validate that the data in the uploaded files will load
correctly. Execute COPY INTO <table> in validation mode
using the VALIDATION_MODE parameter. The VALIDATION_MODE parameter returns
any errors that it encounters in a file.
You can then modify the data in the file to ensure it loads without error.
Type of table support Snowflake?
Snowflake offers three types of tables namely, Temporary, Transient & Permanent. Default is
Permanent:
Temporary tables:
Only exist within the session in which they were created and persist only for the remainder of
the session.
They are not visible to other users or sessions and do not support some standard features
such as cloning.
Once the session ends, data stored in the table is purged completely from the system and,
therefore, is not recoverable, either by the user who created the table or Snowflake.
Transient tables:
Persist until explicitly dropped and are available to all users with the appropriate privileges.
Specifically designed for transitory data that needs to be maintained beyond each session (in
contrast to temporary tables)
Permanent Tables (DEFAULT):
Similar to transient tables the key difference is that they do have a Fail-safe period, which
provides an additional level of data protection and recovery.
What is an External Table in Snowflake?
Snowflake External Tables provide a unique way of accessing the data from files in external
locations(i.e. S3, Azure, or GCS) without actually moving them into Snowflake. They enable
you to query data stored in files in an external stage as if it were inside a database by storing
the file-level metadata.
Type of Snowflake edition?
There are four types of snowflake editions available.
1-Standard Edition
2-Enterprise Edition
3-Business Critical Edition
4-Virtual Private Snowflake (VPS)
What types of stage tables are available in Snowflake?
Snowflake supports two different types of data stages: external stages and internal stages. An
external stage is used to move data from external sources, such as (S3, Azure, or GCS),
buckets, to internal Snowflake tables. On the other hand, an internal stage is used as an
intermediate storage location for data files before they are loaded into a table or after they are
unloaded from a table.
internal stage:-(I)User stage
(II) Table Stage
(III) Named Stage
What is the data retention time in Snowflake?
The standard retention period is 1 day (24 hours) and is automatically enabled for all
Snowflake accounts: For Snowflake Standard Edition, the retention period can be set to 0 (or
unset back to the default of 1 day) at the account and object level (i.e. databases, schemas, and
tables).
Can you explain the concept of a snowflake three-layer architecture?
Database Storage:
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal
optimized, compressed, columnar format. Snowflake stores this optimized data in cloud
storage. Snowflake manages all aspects of how this data is stored — the organization, file size,
structure, compression, metadata, statistics, and other aspects of data storage are handled by
Snowflake.
Query Processing:
Query execution is performed in the processing layer. Snowflake processes queries using
“virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of
multiple compute nodes allocated by Snowflake from a cloud provider. Each virtual
warehouse is an independent compute cluster that does not share compute resources with
other virtual warehouses. As a result, each virtual warehouse has no impact on the
performance of other virtual warehouses.
Cloud Services:
The cloud services layer is a collection of services that coordinate activities across Snowflake.
Services managed in this layer include:
1-Authentication
2-Infrastructure management
3-Metadata management
4-Query parsing and optimization
5-Access control
What is a clone in Snowflake or what is a zero-copy clone in Snowflake?
The most powerful feature of Zero Copy Cloning is that the cloned and original objects(Table,
schema, database) are independent of each other, any changes done on either of the objects
do not impact others. Until you make any changes, the cloned object shares the same storage
as the original. This can be quite useful for quickly producing backups that don’t cost anything
extra until the copied object is changed.
Can you explain what time travel means in Snowflake?
Snowflake Time Travel enables accessing historical data (i.e. data that has been changed or
deleted) at any point within a defined period.
Can you explain the concept of fail-safe in Snowflake?
Fail-safe protects historical data in case there is a system failure or any other failure. Fail-safe
allows 7 days in which your historical data can be recovered by Snowflake and it begins after
the Time Travel retention period ends. Snowflake support team handles this issue.
How to read data from staging table in JSON file.
We are able to read JSON files from the stage layer using the function lateral
flatten.FLATTEN is a table function that takes a VARIANT, OBJECT, or ARRAY column and
produces a lateral view (i.e. an inline view that contains correlation referring to other tables
that precede it in the FROM clause). FLATTEN can be used to convert semi-structured data to
a relational representation
I will update you shortly on your other question………
Snowflake Cost Optimization Series

Pooja Kelgaonkar
·
Follow
2 min read
·
Oct 18, 2023

12
Hello All! Thanks for reading my earlier Snowflake blog series on Data Governance and
Architecting data workloads with Snowflake. If you havent read the earlier blogs then you can
subscribe to my medium and read earlier blogs in the series.
I am glad to start a new series on Snowflake cost optimization. This series consist of below
topics and help you to learn more about the costing model, optimization techniques etc.
1. Understanding Snowflake Cost
2. Warehouse Cost Optimization
3. Understanding Snowflake Optimization services
4. Implementing Snowflake’s Cost Optimization techniques
5. Implement Cost Monitoring & Alerting using Snowsight
Snowflake’s costing model is very transparent and easy to understand. You will be charged
only for the services being used and only for the time they are up and running. You need to
understand Snowflake’s architecture — three layered architecture and their features to
understand the costing model well. You can refer to my earlier blogs to understand some of
the foundational concepts, features required for the cost series.

Snowflake Architecture
1. Read — https://medium.com/@poojakelgaonkar/snowflake-data-on-cloud-
cf3898fee3d0 to understand Snowflake architecture components and their
features, usage.
2. Read — https://medium.com/snowflake/building-dashboards-using-snowsight-
daacf4bd42a8 to understand Snowsight and dashboarding feature.
3. Read — https://medium.com/@poojakelgaonkar/setting-up-alerting-for-snowflake-data-
platform-8b67863eeb07 to know more about Snowflake’s alerting feature.
This series will help you to understand the Snowflake’s costing model, optimization
techniques, services that can be used to improve the performance optimizations, costing of
serverless services, designing and implementing appropriate cost monitoring and alerts to
avoid any unprecdicted cost to the account.
Please follow my blog series on Snowflake topics to understand various aspects of data design,
engineering, data governance, cost optimizations etc.
About Me :
I am one of the Snowflake Data Superheroes 2023. I am also one of the Snowflake SnowPro
Core SME- Certification Program. I am a DWBI and Cloud Architect! I am currently
working as Senior Data Architect — GCP, Snowflake. I have been working with various
Legacy data warehouses, Bigdata Implementations, and Cloud platforms/Migrations. I am
SnowPro Core certified Data Architect as well as Google certified Google Professional Cloud
Architect. You can reach out to me LinkedIn if you need any further help on certification,
Data Solutions, and Implementations!
Object Tagging with Snowflake

Pooja Kelgaonkar
·
Follow
Published in
Snowflake

·
3 min read
·
Aug 18, 2023

11
1

Thanks for reading my earlier blog in the Data Governance series. In case you missed reading
the blog, you can refer to it here — https://medium.com/snowflake/snowflake-dynamic-data-
masking-4ef7b53b414e
This blog helps you understand tagging in Snowflake. This also covers tagging details — how
you can create them, use them, assign to the database objects, and track them for usage.
What is Tagging?
Tagging is the process to define a tag for a database object. This tag can be used to identify the
object and use it to implement data classification, data protection, compliance, and usage.
This can be used in centralized as well as de-centralized approaches to implementation.
Centralized implementation follows the Snowflake recommendation, Role setup, and Access
control policies setup, follows the organizational hierarchy, and flows down the rules, and
policies to the database objects as per hierarchy.
What is Tag?
A tag is a schema-level object that can be defined and assigned to one or more different types
of objects. You can assign a string value to a Tag. A tag can be assigned to a table, views,
columns as well as a warehouse. Snowflake limits the number of tags in an account to 10,000.
You can assign multiple tags to an object. This is an enterprise feature.
How to create Tag?
You can create a tag using CREATE statement.
CREATE TAG cost_center COMMENT = ‘cost_center tag’;
How to assign Tag to an object?
You can use tags while creating objects or you can also assign them using ALTER command.
CREATE WAREHOUSE DEV_WH WITH TAG (cost_center = 'DEV');
ALTER WAREHOUSE QA_WH SET TAG cost_center = ‘QA’;
How object Tag works? What is the hierarchy of Tags?
Tag follows the Snowflake object hierarchy, if you create a tag at a table level then it also gets
applied to the columns.
What are the benefits of Tag?
Below are some of the benefits —
1. Ease of use — Define once and apply it to multiple objects.
2. Tag Lineage — Since tags are inherited, applying the tag to objects higher in the
securable objects hierarchy results in the tag being applied to all child objects. For
example, if a tag is set on a table, the tag will be inherited by all columns in that
table.
3. Sensitive data tracking — Tags simplify identifying sensitive data (e.g. PII, Secret)
and bring visibility to Snowflake resource usage.
4. Easy Resource Tracking — With data and metadata in the same system, analysts
can quickly determine which resources consume the most Snowflake credits
based on the tag definition (e.g. cost_center, department).
5. Centralized or De-centralized Data Management —
Tags support different management approaches to facilitate compliance with
internal and external regulatory requirements.
How to DISCOVER Tags?
You can use SELECT to list the tags-
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAGS ORDER BY TAG_NAME;
What all commands can be used for tags?
You can use all DDL commands for tags — CREATE, ALTER, DROP, UNDROP, SHOW. You
can use these to create, alter, drop, or restore as well as list down the tag details from a
database and schema.
Sample Use case to tag PII data —
1. Create a tag to list all sensitive data —
CREATE TAG sensitive_data COMMENT = ‘PII data tag’;
2. Assign a tag to objects to identify PII data —
USE DATABASE POC_DEV_DB;
USE SCHEMA POC;
ALTER TABLE employee_info modify column employee_ssn SET TAG sensitive_data =
‘Employee data’
Hope this blog helps you to understand object tagging. Tagging helps to define the tag to
objects, used to capture usage and data classification.
About Me :
I am one of the Snowflake Data Superheroes 2023. I am also one of the Snowflake SnowPro
Core SME- Certification Program. I am a DWBI and Cloud Architect! I am currently working
as Senior Data Architect — GCP, Snowflake. I have been working with various Legacy data
warehouses, Bigdata Implementations, and Cloud platforms/Migrations. I am SnowPro Core
certified Data Architect as well as Google certified Google Professional Cloud Architect. You
can reach out to me LinkedIn if you need any further help on certification, Data Solutions,
and Implementations!

Snowflake

Snowflake Tables
9 Snowflake Tables- A briefing(as of 2023)

Somen Swain
·
Follow
Published in

Snowflake

·
12 min read
·
Dec 20, 2023

98
1

In this blog I would be discussing about various Snowflake tables and also some of the use
cases which each of these tables can solve. Post reading this blog, I hope it should give some
insights around each of the table type, how these are significant w.r.t multiple data workloads,
where they can be used, etc.
Overall there are 9 different kind of table Snowflake has that are namely given as follows:
1. Dynamic table
2. Directory table
3. Event table
4. External table
5. Hybrid table
6. Iceberg table
7. Permanent table
8. Temporary table
9. Transient table
Let us go though each one of them as mentioned below:
DYNAMIC TABLES
Dynamic tables, are fundamental units of declarative data transformation pipelines are
dynamic tables. They substantially reduce the complexity of data engineering in Snowflake
and offer an automated, dependable, and economical method of preparing your data for use.
A dynamic table allows you to specify a query and has its results materialized. You can define
the target table as a dynamic table and specify the SQL statement that performs the
transformation, saving you the trouble of creating a separate target table and writing code to
transform and update the data in that table. This feature was renamed as Snowflake Dynamic
Tables and is now available for all accounts. It was first introduced as “Materialized
Tables” at the Snowflake Summit 2022, a name that caused some confusion.

The syndax for creating the dynamic table is:

CREATE OR REPLACE DYNAMIC TABLE DEMO_DYNAMIC_TABLE_CUSTOMER


TARGET_LAG = '1 minute'
WAREHOUSE = COMPUTE_WH
AS
SELECT
cst.c_name,
cst_addr.ca_street_number,
cst_addr.ca_street_name,
cst_addr.ca_street_type,
cst_addr.ca_suite_number
FROM
DEMO_DB.DEMO_SCHEMA.DEMO_CUSTOMER_ADDRESS cst_addr
INNER JOIN DEMO_DB.DEMO_SCHEMA.DEMO_CUSTOMER cst
ON cst_addr.ca_address_id=cst.c_address;

Dynamic table components


Benefits and use cases of Dynamic tables:
Let us now go through some of the benefits and use cases of the dynamic tables:
Some use-cases & benefits of dynamic tables
The best applications for Snowflake Dynamic Tables are those that require automated and
straightforward data transformation. When handling massive amounts of data, where manual
transformation would be laborious and prone to errors, they are especially helpful. For
example we want to avoid writing code to avoid data updates/dependencies, avoiding the
need to control the data refresh schedule, etc.
To know more about dynamic table please go through the
documentation https://docs.snowflake.com/en/user-guide/dynamic-tables-about
over here we get insights about cost, roles, managing the dynamic tables, how it can be
pictorially viewed and governed, etc..
DIRECTORY TABLES
A directory table stores file-level metadata about the data files in the stage, and is
conceptually similar to an external table and is an implicit object layered on a stage rather
than a separate database object. There are no grantable privileges inherent to a directory
table.Directory tables support both internal (Snowflake) and external (external cloud storage)
stages. When a stage is created (using CREATE STAGE) or later (using ALTER STAGE), you
have the option to include a directory table in it.
The event notification service for your cloud storage provider can be used to automatically
update the metadata for a directory table. By performing a refresh, the metadata is brought
into alignment with the most recent set of related files in the external stage and path.

/*Let us consider we have multiple files stored in Snowflake internal


stage
Below is how we can see all the metadata details of the file.*/

SELECT * FROM DIRECTORY( @DEMO_INTERNAL_STAGE_01 ); --> This is how


this table is accessed.
/*We can even query the table as given below */
SELECT * FROM DIRECTORY( @DEMO_INTERNAL_STAGE_01 ) where size>2000;

-- The output has the exact file URL where we do have our data stored.

/* Some more syntaxes with directory tables are as follows */

CREATE STAGE mystage


DIRECTORY = (ENABLE = TRUE) -->This is the keyword
FILE_FORMAT = myformat;

CREATE STAGE mystage


URL='s3://load/files/'
STORAGE_INTEGRATION = my_storage_int
DIRECTORY = (ENABLE = TRUE); -->This is the keyword

The o/p from the directory table command


Benefits and use cases of Directory tables:
Let us now go through some of the benefits and use cases of the directory tables:

Some of the benefits and use cases of directory tables


EVENT TABLES
Event tables are the ones which are specifically designed to capture the logs & events all
natively within the platform. Snowflake’s telemetry APIs are what allows to collect and spread
events. The telemetry APIs are supported across various languages for UDFs, UDTFs, and
stored procedures. These are the primitives that can be utilized independently for formulating
questions, as well as being utilized when creating Native Applications.
With event tables currently there are only specific operations that can be done with them
which are mentioned as below:

Event table operations.


If we see over here we cannot directly update or insert the data to the event tables. Let us see
now how they can be created within the platform.

--Step 1(Creating an event table)::


CREATE EVENT TABLE DEMO_DB.DEMO_SCHEMA.EVENT_TBL_V1;

--Step 2(Associate event table with an account)::


ALTER ACCOUNT SET EVENT_TABLE = DEMO_DB.DEMO_SCHEMA.EVENT_TBL_V1;

--Step 3(setting the log level)


ALTER SESSION SET LOG_LEVEL = INFO;

There are multiple ways through which the log level can be set the above example has the
value as “INFO”, we can also have values like ‘trace’, ‘debug’, ‘info’, ‘warn’, ‘error’. Hence
essentially to make the event table. A simple demo of the event table is given as below:

-- This use case is to track just the INSERT operations counts. Like
how many records got inserted

CREATE OR REPLACE PROCEDURE DEMO_EVENT_TBL_PROC(OTPT VARCHAR)


RETURNS VARCHAR NOT NULL
LANGUAGE SQL
EXECUTE AS OWNER
AS
$$
DECLARE
dmlcount integer;
COL varchar;
QUERY STRING;
counts int := 0;
BEGIN
LET START_DATE TIMESTAMP := CURRENT_TIMESTAMP();
SYSTEM$LOG('INFO', 'Procedure started at: ' || TO_CHAR(START_DATE,
'YYYY-MM-DD HH24:mi:ss'));
--- To perform inserts
QUERY := 'INSERT INTO DEMO_DB.DEMO_SCHEMA.BKP_CUSTOMER SELECT * FROM
DEMO_DB.DEMO_SCHEMA.DEMO_CUSTOMER LIMIT 10';
EXECUTE IMMEDIATE :QUERY;
SELECT $1 INTO counts FROM table(result_scan(last_query_id()));
IF (counts > 0) then
SYSTEM$LOG('INFO', 'Total number of records inserted are: '
|| :counts);
END IF;
LET END_DATE TIMESTAMP := CURRENT_TIMESTAMP();
SYSTEM$LOG('INFO', 'Procedure completed at: ' || TO_CHAR(END_DATE,
'YYYY-MM-DD HH24:mi:ss'));
RETURN OTPT;
END;
$$
;

call DEMO_EVENT_TBL_PROC('A');
select * from DEMO_DB.DEMO_SCHEMA.EVENT_TBL_V1;
SELECT * FROM DEMO_DB.DEMO_SCHEMA.DEMO_CUSTOMER LIMIT 10;
select * from DEMO_DB.DEMO_SCHEMA.BKP_CUSTOMER;

The o/p from the event table


Benefits and use cases of Event tables:
Let us now go through some of the benefits and use cases of the event tables:
Event Table Use Cases
For more information please go through the docs: https://docs.snowflake.com/en/developer-
guide/logging-tracing/tutorials/logging-tracing-getting-started
EXTERNAL TABLES
External tables allows us to query the data from the external stage as if the data is stored
all inside the Snowflake. Over here it is important to not that Snowflake doesn’t store &
manage the external stage hence these are purely managed by customers. External tables let
you store (within Snowflake) certain file-level metadata, including filenames, version
identifiers, and related properties. These are READ ONLY tables.
External tables can also be configured with event notifications(cloud provider like SQS from
AWS), this is to ensure that any activity that happens within external cloud storage is
captured correctly by this table.
We cannot perform any DML operations with these tables, but we can create views on top of
this table and use this table along with other standard table of Snowflake for JOIN operations
to get good insights.
Let us see the below method on how we can create the external table in Snowflake.

--Create the file format.


create or replace file format demo_csv_format
type = 'csv' field_delimiter = ',' skip_header = 1
field_optionally_enclosed_by = '"'
null_if = ('NULL', 'null')
empty_field_as_null = true;

--create the external stage via storage integration object.


create or replace stage demo_db.demo_schema.ext_stage_snow_demo
url="s3://demostorageint/"
STORAGE_INTEGRATION=s3_external_table_storageint ---This is the
storage integration object
file_format = demo_csv_format;
--Create the external tables.
create or replace external TABLE demo_customer_tbl (
CUST_SK varchar AS (value:c1::varchar),
CUST_ID varchar AS (value:c2::varchar)
)
with location=@ext_stage_snow_demo
auto_refresh = false
file_format = (format_name = demo_csv_format)

--Query the external tables.


select * from demo_customer_tbl;

The o/p of the external table.


There is also another column namely “metadata$filename” which stores the information
about the file in the external tables. Hence external tables apart of having details about the
exact data from the file do have metadata information as well.
Metadata columns of “External Table”
Benefits and use cases of External tables:
Let us now go through some of the benefits and use cases of the “external tables” :

Benefits of external tables


Some of the caveats of the “external tables”
Let us now go through some of the caveats of the “external tables” :

Caveats of external tables


More reads on external tables are given as below:

Introduction to External Tables | Snowflake Documentation


An external table is a Snowflake feature that allows you to query data
stored in an external stage as if the data were…
docs.snowflake.com

HYBRID TABLES
Hybrid tables are a new Snowflake table type powering Unistore. A key design principle is
to have this table support all the transactional capabilities need. These are highly performant
which is a need of any transactional application & support fast single row
operations. They work on entirely new row-based storage engine. This is unlike other
tables in Snowflake where data is stored in columnar way. Currently these are still in PrPr
stage within Snowflake but once it is made available for general use it has got immense
potential to unlock multitude of OLTP use cases.

Pic courtesy Snowflake → Hybrid Tables


Hybrid tables is going to enable the new workload of Snowflake i.e., “Unistore” which is a new
workload that delivers a modern approach to working with transactional and analytical
data together in a single platform. Hybrid tables were announced on Snowflake Summit
2022, hence this is also one of the feature which is eagerly awaited by all the data enthusiasts.
Below is how the tables would be defined for Hybrid, please note the use “PRIMARY KEY” &
“FOREIGN KEY” constraints which would be enforced with the use of Hybrid tables.

CREATE HYBRID TABLE Customers (


CustomerKey number(38,0) PRIMARY KEY,
Customername varchar(50)
);

-- Create order table with foreign key referencing the customer table
CREATE OR REPLACE HYBRID TABLE Orders (
Orderkey number(38,0) PRIMARY KEY,
Customerkey number(38,0),
Orderstatus varchar(20),
Totalprice number(38,0),
Orderdate timestamp_ntz,
Clerk varchar(50),
CONSTRAINT fk_o_customerkey FOREIGN KEY (Customerkey) REFERENCES
Customers(Customerkey),
INDEX index_o_orderdate (Orderdate)); -- secondary index to accelerate
time-based lookups

Some of the key features/concepts of this tables are:


1. Storage :: These hybrid tables data is stored as 2 copies one in the row storage &
other in column storage. Hence a trade-off is that it would be a bit costly.
2. Locking concept :: Locking happens at a row level for hybrid tables. This
would enable higher concurrency execution of single row updates.
3. Indexes :: Hybrid table uses B-Tree indexes both for primary key and secondary
indexes. A good example is shown below see how table with secondary index
defined skips the full table scan part from query profile.
4. Compatible with connectors :: Hybrid tables are compatible with all
connectors just like others.
5. Composite key :: There are can be more than 1 column defined as primary key
in hybrid table which would make it as composite key.
6. Scalability :: Hybrid tables would have some size limit and it would be in some
terabytes. This is unlike other tables in Snowflake.
7. Pricing model :: It would be a bit costly considering the data would be stored
in two forms(row and columnar) but as of today the pricing model is not yet
finalized.
Benefits and use cases of Hybrid tables:
Let us now go through some of the benefits and use cases of the “Hybrid tables” :

Benefits & Use cases of Hybrid tables


More reads: https://www.snowflake.com/blog/introducing-unistore/
ICEBERG TABLES
Iceberg tables are the brand new tables which has been made available as public preview
for all accounts only recently. Now, these tables are powered by “Apache iceberg open table
formats” and the idea is to store the data & metadata all within customer managed storage
and being able to use the standard Snowflake features like execute the DMLs, encryption.
These tables has the ability to unlock a lot of “data lake” use cases.
Iceberg tables for Snowflake integrate your own external cloud storage with the query
semantics and performance of standard Snowflake tables. They are perfect for data lakes that
already exist and that you can’t or don’t want to store in Snowflake.
There are 2 important concepts to understand over here namely “Iceberg Catalog” &&
“External Volume”
What is an Iceberg Catalog ?
The Iceberg table specification’s first architectural layer is the Iceberg catalog. It is the
compute engine which can manage and load Iceberg tables. This supports:
 Preserving the pointer to the current metadata for one or more Iceberg tables.
 Updating a table’s current metadata pointer by carrying out atomic operations
Snowflake currently supports 2 Iceberg catalog options namely :
1. Snowflake managed Iceberg Catalog.
2. Externally managed Iceberg Catalog.(via catalog integrations).
The variations between these catalog options are summarized in the table that follows.

Comparison → Courtesy Snowflake


What is an External Volume ?
The external cloud storage identity and access management (IAM) entity is stored in
an external volume, which is a named, account-level Snowflake object. In order to access
table data, Iceberg metadata, and manifest files containing the table schema, partitions, and
other metadata, Snowflake safely connects to your cloud storage via an external volume.
Iceberg table creation.
We need to designate an external volume and a base location (directory on the external
volume) where Snowflake can write table data and metadata in order to create an Iceberg
table with Snowflake acting as the catalog.
CREATE ICEBERG TABLE myTable
CATALOG='SNOWFLAKE'
EXTERNAL_VOLUME='myIcebergVolume'
BASE_LOCATION='relative/path/from/extvol/location/';

External volume configuration steps can be seen here: https://docs.snowflake.com/en/user-


guide/tables-iceberg-configure-external-volume#label-tables-iceberg-configure-external-
volume-s3
Benefits/Key points/Use cases of Iceberg tables:
Let us now go through some of the benefits and use cases of the “Iceberg tables” :

More reads:
https://www.snowflake.com/blog/build-open-data-lakehouse-iceberg-tables/
https://docs.snowflake.com/en/user-guide/tables-iceberg
PERMANENT, TEMPORARY & TRANSIENT TABLES
Lastly, let us throw some insights around most widely used tables i.e., Permanent, Temporary
& Transient tables. These tables have been there since years now and hence is used by almost
every other individual who has been associated with Snowflake.
What is Permanent table?
The typical, everyday database tables are the “Permanent Tables”. Snowflake’s default
table type is permanent, and making one doesn’t require any extra syntax during creation.
The information kept in permanent tables takes up space and gets added to the storage fees
that Snowflake charges you.
In addition, it has extra features like Fail-Safe and Time-Travel that aid in data availability
and recovery.
--The syntax for creating the permanent table is given as below:
create table student (id number, name varchar(100));

What is Transient table?


With the exception of having a very short Time-Travel period and no Fail-safe
period, Transient tables in Snowflake are comparable to permanent tables. These work
best in situations where the information in your table is not urgent and can be retrieved
through other channels if necessary.
Like permanent tables, transient tables add to the overall storage costs associated with your
account. Nevertheless, there are no fail-safe costs (i.e., the costs related to maintaining
the data required for fail-safe disaster recovery) since transient tables do not use fail-safe.

create transient table student (id number, name varchar(100));

What is Temporary table?


In Snowflake, temporary tables are only available during the duration of the
session in which they were created. Other users or sessions cannot see them. The data in the
table is fully deleted and irretrievably lost when the session ends.
Temporary tables, like transient tables, have a very short Time-Travel period and no Fail-safe
period. These work best for holding temporary, non-permanent data that is only needed for
the duration of the creation session.

create temporary table student (id number, name varchar(100));

The Comparison
The differences between the three table types are outlined in the table below, with special
attention to how they affect fail-safe and time travel:

The comparison of the Permanent/Temporary & Transient table


Benefits and use cases of Permanent/Temporary/Transient tables:
Let us now go through some of the benefits and use cases of the “external tables” :

SUMMARY:
There are a number of Snowflake tables and this blog is written to give insights around each
one of them and what each table does. Using the right kind of table for each of the scenarios
we handle shall bring the best out of this platform.
Please keep reading my blogs it is only going to encourage me in posting more
such content. You can find me on LinkedIn by clicking here and on Medium here.
Happy Learning :)
Awarded as “Data Superhero by Snowflake for year 2023”, click here for more
details.
Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view
of my current, former, or future employers.

Data And Analytics

Snowflake

Snowflake Data Cloud

A Simple Parameterized Query in Snowflake Scripting? Not So Easy :(

Cristian Scutaru
·
Follow
8 min read
·
Nov 10, 2023

26
1

I remember last February, when they announced the availability of Snowflake Scripting.
Oracle had PL/SQL, Microsoft SQL Server had Transact-SQL, it was about time for Snowflake
to come up with their own procedural SQL language. All in the general effort of bringing data
processing closer to where the data is. And making this data processing less verbose and more
intuitive.
It was very exciting for me, as I just had to design and implement — for a large client — a huge
generic Snowflake data pipeline with …heavy stored procedures in JavaScript (the only
available alternative at that time).
However, this is the first time since that I had a few hours for a rather deep dive. And, despite
the obvious utility of this API, …
…I have strong doubts at this moment that Snowflake Scripting is mature
enough for production.
Simple Requirements
My need was for a complex generic query encapsulated in a secure tabular function, to be
exposed through a secure data share. To keep it simple here, let’s illustrate this with a very
simple UNION query, that basically duplicates the rows of any table, when you pass the table
name as argument. Here is my very simple test data:
create or replace database script_test;
use schema script_test.public;

-- create a simple table of customers, w/ two columns and two rows


create table customers(name varchar, age int);
insert into customers
values ('John', 22), ('Mary', 35);

-- create a simple table of products, w/ three columns and three rows


create table products(id int, name varchar, quantity int);
insert into products
values (23, 'batteries', 100), (11, 'pens', 225), (151, 'notebooks',
14);

What I want is a simple UDTF to return duplicate rows on ANY table. This is the hard-coded
version for just the CUSTOMERS table:

create or replace function dup_customers()


returns table(name varchar, age int)
language sql
as 'select * from customers
union all
select * from customers';

select * from table(dup_customers());

It works, but if I try to return a generic TABLE() type instead, I get:


Mismatch between declared return signature column count (0) and actual column count (2)
I cannot let the function infer the returned columns and their types at runtime, as the doc
implies: “Otherwise (e.g. if you are determining the column types during run
time), you can omit the column names and types”.
It looks like whenever a code block can infer the returned table columns at compile time, you
also have to declare them in the RETURNS TABLE clause. Isn’t this redundant?…
UDTF in SQL? Not working.
To make it generic, I would need something like:

create or replace function dup_any(table_name varchar)


returns table()
language sql
as
EXECUTE IMMEDIATE 'select * from ' || table_name
|| ' union all select * from ' || table_name;
Remark that I tried to pass a single SQL expression, with no Scripting block. However, it is
not working, as EXECUTE IMMEDIATE may be part of Snowflake Scripting and is not
allowed in pure SQL expression.
When I use the code below, with session variables, it’s fine. But this is in a SQL worksheets,
where both SQL statements and Scripting operations are allowed:

SET table_name = 'products';


SET stmt = 'select * from ' || $table_name
|| ' union all select * from ' || $table_name;
EXECUTE IMMEDIATE $stmt;

UDTF in Snowflake Scripting? Not supported.


OK, let’s try then with a typical better implementation in Snowflake Scripting, with a code
block:

create or replace function dup_any_udtf_script(table_name varchar)


returns table()
language sql
as
begin
LET c1 CURSOR FOR
select * from identifier(?)
union all
select * from identifier(?);
OPEN c1 USING (:table_name, :table_name);
RETURN TABLE(RESULTSET_FROM_CURSOR(c1));
end;

I know, I know, I checked the doc as well, and it clearly says (at the present moment at least),
that you can write stored procs with Snowflake Scripting, but only SQL (with simple
expressions) is allowed for UDFs and UDTFs.
But why are they not capable to detect that I pass a code block here (starting with BEGIN or
DECLARE) and simply tell me something like “Snowflake Scripting cannot be used for
UDFs!”. The messages I got were:
Syntax error:L compilation error: (line 54)
syntax error line 2 at position 4 unexpected ‘LET’.
syntax error line 3 at position 17 unexpected ‘from’.
syntax error line 3 at position 33 unexpected ‘?’.
syntax error line 5 at position 17 unexpected ‘from’.
syntax error line 5 at position 33 unexpected ‘?’. (line 54)
Stored Proc in Scripting? LANGUAGE not optional!
Let’s write it as a stored proc then, skipping the LANGUAGE header. They clearly say: “Note
that this is optional for stored procedures written with Snowflake Scripting”!
create or replace procedure dup_any_sp(table_name varchar)
returns table()
as
begin
LET c1 CURSOR FOR
select * from identifier(?)
union all
select * from identifier(?);
OPEN c1 USING (:table_name, :table_name);
RETURN TABLE(RESULTSET_FROM_CURSOR(c1));
end;

Now I get a very cryptic:


000603 (XX000): SQL execution internal error: Processing aborted due to error
300002:1162926047; incident 4107280.
And after more research I find on the community forum that “When returning a table in
a Snowflake Scripting stored procedure, add: ‘language SQL’ before the AS
clause and the procedure body.”.
OK then, we’ll add the missing LANGUAGE SQL and now it works!

create or replace procedure dup_any_sp(table_name varchar)


returns table()
language sql
as
begin
LET c1 CURSOR FOR
select * from identifier(?)
union all
select * from identifier(?);
OPEN c1 USING (:table_name, :table_name);
RETURN TABLE(RESULTSET_FROM_CURSOR(c1));
end;

call dup_any_sp('customers');
-- select * from table(result_scan(last_query_id()));

call dup_any_sp('products');

A CALL in a code block can use the INTO clause to pass the returned result (yes, stored procs
return results, just like functions!) into a local variable. While the result of an outside CALL,
from a SQL worksheet, is only dumped on screen in Snowsight.
Alternative with DECLARE
We could initialize the cursor in a DECLARE block as well:
create or replace procedure dup_any_sp10(table_name varchar)
returns table()
language sql
as
declare
c1 CURSOR FOR
select * from identifier(?)
union all
select * from identifier(?);
begin
OPEN c1 USING (:table_name, :table_name);
RETURN TABLE(RESULTSET_FROM_CURSOR(c1));
end;

call dup_any_sp10('customers');

The Snowflake doc says basically everywhere that “RESULTSET_FROM_CURSOR


returns a RESULTSET”. But this is at least confusing…
Try to assign RESULTSET_FROM_CURSOR(c1) to a RESULTSET variable, you’ll not be able
to do it. What I suspect is RESULTSET_FROM_CURSOR is rather a UDTF, which always
requires a TABLE! It does not return a RESULTSET object, but the query result (as a
collection of rows) a RESULTSET may point to.
Alternative with RESULTSET
The function is still obviously too verbose. And do I really need to pass through a cursor, and
then through a result set? A cursor offers bind variables: ? (question marks) in the SQL
statement are safely replaced at runtime with USING variables.
The result set does not offer bind variables, but you can make in-place replacements of the
variable data in the SQL statement instead. And we could eventually move the record set in a
DECLARE block.
This looks better because there is no explicit cursor anymore. And the query is automatically
executed (by an implicit hidden cursor!) in the DECLARE block, on RESULTSET’s
declaration!

create or replace procedure dup_any_sp2(table_name varchar)


returns table()
language sql
as
declare
r1 RESULTSET DEFAULT (
select * from identifier(:table_name)
union all
select * from identifier(:table_name));
begin
RETURN TABLE(r1);
end;

call dup_any_sp2('customers');

You can skip the RESULTSET declaration and get the equivalent short version (thanks, Tom
Meacham!):

create or replace procedure dup_any_sp22(table_name varchar)


returns table()
language sql
as
begin
LET r1 RESULTSET := (
select * from identifier(:table_name)
union all
select * from identifier(:table_name));
RETURN TABLE(r1);
end;

call dup_any_sp22('customers');

It’s interesting that you cannot pass the SQL statement directly to RETURN TABLE, you have
to pass it always through an intermediate RESULTSET.
But any inline SQL statement — with no assignment at all — will also get executed right away,
as it happens for the SQL-based procedures or functions. However, for Scripting there will be
no result returned, at all.
Alternative with EXECUTE IMMEDIATE
One other way to do it — not necessarily better — is to build the whole SQL statement as a
string.
Remark how we don’t need the : notation here. It is only required for local variables or
function parameters when you insert them into inline SQL statements. But to me it is rather
confusing when we use these variables with or without prefix. Especially because we do not
always get clear error or warning messages when we make some expected expected mistakes.

create or replace procedure dup_any_sp3(table_name varchar)


returns table()
language sql
as
begin
LET stmt := 'select * from ' || table_name
|| ' union all select * from ' || table_name;
LET r1 RESULTSET := (EXECUTE IMMEDIATE :stmt);
RETURN TABLE(r1);
end;

call dup_any_sp3('customers');

Try to remove the RESULTSET data type from the LET r1 command, assuming the type will
be inferred. It is not! With a much simpler use case you get a proper warning (“Variable ‘R1’
cannot have its type inferred from initializer”). But in this slightly more complicated use case
we get a very confusing and misplaced “unexpected ‘immediate’” error.
The following immediate alternative will separate and first declare the local variables in a
DECLARE block:

create or replace procedure dup_any_sp31(table_name varchar)


returns table()
language sql
as
declare
stmt VARCHAR DEFAULT '';
r1 RESULTSET;
begin
stmt := 'select * from ' || table_name
|| ' union all select * from ' || table_name;
r1 := (EXECUTE IMMEDIATE :stmt);
RETURN TABLE(r1);
end;

call dup_any_sp31('customers');

It works, but what I found confusing — and I wasted some time on it — is yet again the lack of
clarity in the returned error messages, when you make a very honest mistake.
Try to leave the LET keywords as they were, and you get all sorts of error messages, either on
CREATE, or on CALL. But nothing will tell you that you should not use LET with variables
that you already declared before. Or, if you can do it, get at least a warning about it!
I repeated the experiment and I found indeed a simpler use case where they tell me this
(something like “Variable with name X declared twice.”). But this should get better
coverage…
Conclusions
 Snowflake should make it very clear that UDFs and UDTFs are NOT supported
with Snowflake Scripting at this time. A CREATE FUNCTION should return a
much better error message when a statement starting with BLOCK or DECLARE
is detected. From the inconsistencies I’ve seen, I can only assume that the support
for UDFs with Scripting in coming. But in the meantime, we should not spend
hours figuring out what the errors are.
 You either make everywhere LANGUAGE SQL required, or you make it optional
but you clearly know to detect immediately a SQL statement or a block of
Scripting code. If in doubt, send a clear error message with “Add a LANGUAGE
header, as we are not sure what the language is”.
 There is frequent confusion where the : prefix notation is truly required or not,
for local script variables and function arguments. And yes, we know that it’s
required in inline SQL statements (not as literal strings!). But an universal usage
of any variable with the : prefix in a script block would make everything more
consistent and much easier to follow.
 Many error messages for Snowflake Scripting are rather confusing and frequently
misplaced. It’s very hard to debug and figure out what was wrong even in a very
small block, with just 10–15 lines of code. If Snowflake Scripting is indeed a very
serious extension— as it’s supposed to be — improve the user experience as well.
 There are many bugs, or confusing things that are not supposed to happen, as I
illustrated with just a few use cases before.
 The doc for Snowflake Scripting is flooded everywhere with the alternative using
required $$ delimiters in SnowSQL and classic UI. Yes, it is a temporary
limitation, but you should find a better way to signal it, instead of repeating it
again and again for almost every example.

Testing the Snowflake Query Acceleration Service with a 17 TB table


Snowflake has an easy way to make queries faster while using the
smallest warehouses: The Query Acceleration Service has been ready for
production (GA) since February 2023. Let’s test it right now by scanning
17 terabytes of GitHub events.

Felipe Hoffa
·
Follow
Published in

Snowflake

·
6 min read
·
Jan 10

102
Image generated by AI
Amping up Snowflake’s compute
Traditionally Snowflake has offered 2 easy ways of increasing compute power when dealing
with larger queries and concurrency:
 Scale your session’s “virtual warehouse” to a larger size.
 Set up “multi-cluster warehouses” that dynamically add more clusters to deal
with peaks of concurrent usage.
With these 2 basic elements, users are able to set up policies to control costs and divide
resources between different teams and workloads.
For example — for my Snowflake experiments I usually do everything on my own “Small
Warehouse”. This keeps costs low, and it’s usually pretty fast and predictable. I only need to
scale to larger warehouses when dealing with huge transformations and extracts, like the
example we are going to play with today.
Scaling a warehouse up to get faster results on slower queries is super easy, barely an
inconvenience. I can jump at any moment from a “Small WH” to an “Extra Large WH”, to a
“Medium WH”, to a “4X-Large WH”, etc. This is cool, but then the question becomes: “How
can I tell exactly what’s the best WH size for my upcoming queries?”
Instead of resizing warehouses, it would be really cool if I could run my whole session on a
“Small WH” (or an “Extra Small WH”) and then I could have Snowflake automatically
intercept my larger queries, and run them with way more resources in a “magic serverless”
way.
And that’s exactly what the new Query Acceleration Service does. Let’s test it out here (with a
Snowflake Enterprise Edition account).
Conversation and live demo featuring Query Acceleration Service
Extracting data from GitHub’s 17 Terabyte Archive

Querying GitHub Archive with Snowflake: The Essentials


These are the essential tips for querying GitHub’s ongoing history with
Snowflake. Use these tips to explore and…
medium.com

For this experiment we are going to look into GH Archive — a collection of all GitHub events.
I did a lot of experiments with it in my past life at Google, and now Cybersyn has made a copy
of the GH Archive on the Snowflake Marketplace.
To bring this dataset into your Snowflake account, just ask your Account Admin to import it
at no cost:
Importing the GH Archive into your Snowflake account
For more tips, check my post “Querying GitHub Archive with Snowflake: The Essentials”. In
the meantime let’s continue with a straightforward example.
Note in the above screenshot that I renamed the incoming database to GHARCHIVE for
cleaner querying.
Once we have GHARCHIVE in our account, we can see 3 tables — with the main one
being events:
select count(*)
from gharchive.cybersyn.github_events;
-- 1.4s
-- 6,966,010,260

That’s 7 billion rows of rich history — and a lot of data to deal with. The first step when
exploring datasets this large should be to extract a subset of rows with the data we are
interested in:
For example, this is the whole history of Apache Iceberg on GitHub:

create or replace table gharchive_iceberg


as
select *
from gharchive.cybersyn.github_events
where repo_id = 158256479
order by created_at
-- 19m 52s small
-- 96s xxlarge
;

This table extraction took only 96 seconds on a “2X Large WH”, but some long ~22 minutes
on my usual “Small WH”.
Can the Query Acceleration Service (QAS) help here? There’s a very easy way to tell:

select system$estimate_query_acceleration('01b191db-0603-f84f-002f-
a0030023f256');

And the response is “yes” (when using the query id from the ~22m run):

{
"queryUUID":"01b191db-0603-f84f-002f-a0030023f256",
"status":"eligible",
"originalQueryTime":1191.759,
"estimatedQueryTimes":{"1":608,"2":411,"4":254,"8":149,"31":55},
"upperLimitScaleFactor":31
}

Snowflake is telling us that the query took 1191s, and if we had let the QAS service help, it
could have taken between 608s and 55s — depending on the max scaling factor we would
allow it (in this case, up to 31).
To test QAS, I created a new WH. To make this test more dramatic, I made it an “Extra Small”
with unlimited scaling power:

use role sysadmin


;
create warehouse xs_acc
warehouse_size = xsmall
enable_query_acceleration = true
query_acceleration_max_scale_factor = 0
;
grant usage on warehouse xs_acc to public
;

If I use the “Extra Small WH with unlimited QAS”, Snowflake now automatically accelerates
this query

use warehouse xs_acc


;
create or replace table gharchive_iceberg
as
select *
from gharchive.cybersyn.github_events
where repo_id = 158256479
order by created_at
-- 19m 52s small
-- 96s xxlarge
-- 77s xs_acc
;

To check the cost of this QAS query that ran in 77s while I was working within a “Extra Small
WH” session, we can check the logs:

use role accountadmin


;
select * from
table(information_schema.query_acceleration_history())
;

-- CREDITS_USED WAREHOUSE_NAME NUM_FILES_SCANNED NUM_BYTES_SCANNED


-- 0.499199157 XS_ACC 296,389 5,025,053,295,104

We can see that the query scanned 5 terabytes of data, for a total cost of 0.4 credits.
Depending on the region with an Enterprise Edition Snowflake account that should be around
$1.5 dollars.
In comparison:
 Small WH, Enterprise edition, 1192s: $3*2*1192/3600 = $1.99 (+ time
between queries and auto-suspend)
 2XL WH, Enterprise edition, 96s: $3*32*96/3600 = $2.56 (+ time between
queries and auto-suspend)
 QAS 21x auto-acceleration, within a X-Small session, 77s: $3*0.5
= $1.5 (serverless model, no auto-suspend needed for the QAS queries — but the
XS session kept running for $3*1*67/3600=$0.06 extra)
This is the power of the Query Acceleration Service: When it works, we don’t need to worry
anymore about re-sizing warehouses, and we can let Snowflake take care of the huge queries
that need extra power.

Image generated by AI
Query Acceleration Caveats
QAS is Generally Available (GA) in Snowflake and ready for you to use.
However you will notice that it’s picky on which queries it decides to accelerate — and I expect
this set of supported queries to grow over time.
You can find a handy history of the queries in your account that could have been accelerated:

SELECT query_id, eligible_query_acceleration_time


FROM snowflake.account_usage.query_acceleration_eligible
ORDER BY eligible_query_acceleration_time DESC;

The docs also list what kind of queries are not eligible for acceleration (for now):
Some queries are ineligible for query acceleration. The following are
common reasons why a query cannot be accelerated:

- The query does not filter or aggregate.


- The filters are not selective enough. Alternatively, the GROUP BY
expression has a high cardinality.
- There are not enough partitions. If there are not enough partitions
to scan, the benefits of query acceleration are offset by the latency
in acquiring resources for the query acceleration service.
- The query includes a LIMIT clause but does not have an ORDER BY
clause.
- The query includes functions that return nondeterministic results
(for example, SEQ or RANDOM).

For example this query could not get QAS during my tests:

select min(created_at), max(created_at), current_timestamp()


from gharchive.cybersyn.gh_events
where repo_id = 158256479
limit 10

But this one that produces the same results does — thanks to adding group by and order
by:

select repo_id, min(created_at), max(created_at), current_timestamp()


from gharchive.cybersyn.github_events
where repo_id = 158256479
group by repo_id
order by repo_id
limit 10
-- 23s xs_acc

Next steps
Conversation and live demo featuring Query Acceleration Service
 To go deeper analyzing GitHub, check my post “Querying GitHub Archive with
Snowflake: The Essentials”.
 Check my conversation and live demo featuring Query Acceleration Service with
Product Manager
Tim Sander
.
 Check out how many of your queries could have been accelerated with QAS in
your account.
 Report results, and give us feedback on what else you’d like QAS to take care of.
Try the combination of QAS + Search Optimization.

Want more?
The One Billion Row Challenge with Snowflake

Sean Falconer
·
Follow
Published in

Snowflake

·
9 min read
·
Jan 9

158
3

I recently learned about the One Billion Row Challenge initiated by Gunnar Morling. The
challenge is to create a Java program to process a CSV file containing 1 billion rows, where
each row contains the name of a weather station and a recorded temperature. The program
needs to calculate the minimum, average, and maximum temperatures for each weather
station.
Originally tailored for Java, the challenge has attracted interest in other programming
languages, as well as approaches using databases and SQL. Initially, I considered using this
challenge as an opportunity to delve into Rust or Go, two languages I’ve been eager to try.
However, I was inspired by Robin Moffat and Francesco Tisiot, who successfully tackled the
challenge using SQL with DuckDB, PostgreSQL, and ClickHouse. Motivated by their
approach, I chose to undertake the challenge using Snowflake.
This article details my various experiments and performance results.
Things to Note on Measuring Performance
In the actual challenge, all programs are run on a Hetzner Cloud CCX33 instance with 8
dedicated vCPU, 32 GB RAM. The time program is used to measure the execution times.
With Robin and Francesco’s work, they ran their databases locally. Clearly that’s not possible
with Snowflake as it is only available in the cloud.
Additionally, the compute power for Snowflake is determined by the virtual warehouse size.
You can think of a virtual warehouse a bit like a remote machine that does compute for you. A
larger warehouse is going to give you more query processing power to reduce the overall time,
so this is not going to be an exact apples to apples comparison, but hey, this is just for fun :-).
For the purposes of my testing, I used the Large warehouse option. In my testing, a Large
warehouse was about 5 times faster during a table copy than an X-Small warehouse.
Loading and Processing CSV Files in Snowflake
The first step is to load the CSV file into Snowflake. There’s a number of ways to do this, from
using Snowsight directly to Snowsql or hosting data externally to Snowflake within a S3
bucket. For this particular situation, Snowsight is not a viable option as the file we need to
process is 13 GBs and Snowsight’s file limit is 50 MBs.
For the challenge, I tested out two primary approaches: an internal stage and external stage,
along with a few variations for each approach.
A Snowflake stage is used for loading files into Snowflake tables or unloading data from tables
into files. An internal stage stores data within Snowflake, while an external stage references
data in a location outside of Snowflake like a S3 bucket.
In the first approach with an internal stage, I used Snowsql to move the file from my local
machine into a Snowflake stage, then copy the file into a table, and execute the SQL against
the table.
The second approach used an external stage that points to an AWS S3 bucket and an external
table. I kept all data external to Snowflake and avoided loading a native table.
There’s pros and cons of both approaches. Let’s start by looking at the internal stage
approach.
The Internal Stage Approach
After cloning the 1brc repo and generating the dataset, the data needs to be uploaded into an
internal stage. With the data uploaded, we’ll copy the data into a table where we can run a
query to produce the required output. The flow for this process is similar to what’s shown in
the diagram below.

Example flow for moving measurement files into a table and the query operation needed to produce
the challenge output.
To. move the measurement data into an internal stage I used the following commands.
CREATE OR REPLACE FILE FORMAT measurement_format
TYPE = 'CSV'
FIELD_DELIMITER = ';';

CREATE OR REPLACE STAGE measurements_record_stage


FILE_FORMAT = measurement_format;

PUT file:///LOCATION_TO_FILE/1brc/measurements*
@measurements_record_stage
AUTO_COMPRESS=TRUE PARALLEL=20;

I’m compressing the CSV file into a zip file and the PARALLEL parameter helps speed up the
upload by using mutiple threads. The output of the PUT command looks like this:

Example of internal stage content after uploading measurements.txt.


Note that I didn’t consider staging the file to be part of performance measurement. It’s
essentially equivalent to moving the file onto a Hetzner Cloud CCX33 instance as done in the
main challenge.
Creating a Table from the Internal Stage
The next step is to copy the internal stage data into a table.
First I created a table called measurements.

CREATE OR REPLACE TABLE measurements (


location VARCHAR(50),
temperature NUMBER
);

Next, I used the COPY INTO command to move the stage data into the measurements table.

COPY INTO measurements


FROM @measurements_record_stage
FILE_FORMAT = (FORMAT_NAME = measurement_format)
ON_ERROR = 'skip_file';

Doing this with the 13 GB CSV file is quite slow as you can see in the performance table below.
By keeping all the data in a single file, we miss out on the parallel processing power of a cloud-
based platform like Snowflake. The documentation recommends keeping files under 250 MBs
compressed to optimize for parallel loading.
To address this, I split the measurements file into a collection of smaller files. I experimented
with splitting the file into different size chunks to see how that impacted performance. The
best result was with 10 million records per file and a total of 100 files.
Time to copy files into a table depending based on splitting up the files.
Moving the data from the stage into a table is where the bulk of the operational cost is going
to be. Breaking the file up into chunks to parallel process the records during the table copy
process saves a lot of time, but there’s still a significant cost.
In contrast, when handling this challenge in a programming language like Java, you can
process the file and simultaneously compute the required values, storing them in a data
structure tailored for the intended output. Creating a specialize tailored solution is going to
likely be less costly than loading everything into a general-purpose database.
However, in the world outside of this competition, the advantage of a database is that you can
run a myriad of queries against the dataset with very high performance. The flexibility and
efficiency in querying typically outweighs the initial data loading costs.
Querying the Data in Snowflake
The last step is to query the data and produce the results in the format as specified by the One
Billion Row Challenge.
Creating a list of results grouped by location along with the minimum, average, and maximum
values is pretty straightforward.

SELECT location,
MIN(temperature) AS min_temperature,
CAST(AVG(temperature) AS NUMBER(8, 1)) AS mean_temperature,
MAX(temperature) AS max_temperature
FROM measurements
GROUP BY location;

However, the challenge specifies that the data must be printed in alphabetical order in a
single line similar to what’s shown below:
{Abha=5.0/18.0/27.4, Abidjan=15.7/26.0/34.1, Abéché=12.1/29.4/35.6,
Accra=14.7/26.4/33.1, Addis Ababa=2.1/16.0/24.3,
Adelaide=4.1/17.3/29.7, …}

To satisfy this requirement, I created an outer query that concats the bracket, slash, and
comma characters as needed and flattens the results using the LISTAGG function.

SELECT '{' || LISTAGG(location || '='


|| CONCAT_WS('/', min_temperature, mean_temperature,
max_temperature), ', ')
WITHIN GROUP (ORDER BY location) || '}'
FROM (
SELECT location,
MIN(temperature) AS min_temperature,
CAST(AVG(temperature) AS NUMBER(8, 1)) AS mean_temperature,
MAX(temperature) AS max_temperature
FROM measurements
GROUP BY location);

With this in place, the average query time with caching disabled is 1.77 seconds.
This gives the following overall timings:
 23.03 seconds
- 21.26 seconds is table copy time
- 1.77 second query time
If query caching is on, then once the first query is run, subsequent queries are about 10 times
faster, with an average time of 0.182 seconds.
An overall time of 23.03 seconds is decent, but not spectacular. The best Java programs at the
time of this writing are close to 6 seconds.
Avoiding the Table Copy Operation
Since the bulk of the operational cost of the first approach is the table copy, I attempted to see
how I could avoid that.
Snowflake supports querying directly against an internal stage. The actual query syntax is
very similar to querying against a table as shown below.

SELECT '{' || LISTAGG(location || '='


|| CONCAT_WS('/', min_temperature, mean_temperature,
max_temperature), ', ')
WITHIN GROUP (ORDER BY location) || '}'
FROM (
SELECT t.$1 as location,
MIN(t.$2) AS min_temperature,
CAST(AVG(t.$2) AS NUMBER(8, 1)) AS mean_temperature,
MAX(t.$2) AS max_temperature
FROM @measurements_record_stage t
GROUP BY t.$1);

Snowflake’s documentation explicitly states that querying a stage is intended to only be used
for simple queries and not a replacement for moving data into a table.
I ignored this warning just to see what the performance would be like. Querying the billion
records contained within a single file was super slow. So slow that I killed the query
operation.
However, similar to the optimization I made for the table copy where I split the records into
multiple files, when I use the 100 files with 10 million records each, I am able to reduce the
total time to calculate the minimum, average, and maximum temperatures and print the
result to 12.33 seconds. That’s about 50% faster than the first method!
The External Stage Approach
I also wanted to see how an external stage would perform. My goal with the external stage is
to keep the data from being copied into Snowflake. Under normal circumstances there could
be financial reasons for doing this, but for this test, I was interested in comparing the
performance with the internal stage approach previously discussed.
To create the external stage, I first uploaded the temperature measurement files to an S3
bucket. Similar to the internal stage approach, I broke up the single file into 100 separate
files.

AWS S3 bucket after uploading 100 measurement files.


There’s a few of AWS IAM permissioning steps required to allow Snowflake to read or write to
a S3 bucket you own. But once I had those configurations are in place, I created a storage
integration as shown below.

CREATE STORAGE INTEGRATION s3_one_brc


TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<AWS_ROLE_ARN>'
STORAGE_ALLOWED_LOCATIONS = ('s3://one-brc-sean/', 's3://one-brc-
sean/');

Then created the external stage referencing the storage integration.

CREATE STAGE s3_brc_stage


STORAGE_INTEGRATION = s3_one_brc
URL = 's3://one-brc-sean/'
FILE_FORMAT = measurement_format;

To query the external stage, I created an external table as shown below.

CREATE OR REPLACE EXTERNAL TABLE external_measurements(


location VARCHAR AS (value:c1::varchar),
temperature NUMBER AS (value:c2::number)
)
WITH LOCATION = @s3_brc_stage
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ';' SKIP_HEADER = 1);

Finally, I can query the data similar to an internal table.

SELECT '{' || LISTAGG(location || '='


|| CONCAT_WS('/', min_temperature, mean_temperature,
max_temperature), ', ')
WITHIN GROUP (ORDER BY location) || '}'
FROM (
SELECT location,
MIN(temperature) AS min_temperature,
CAST(AVG(temperature) AS NUMBER(8,1)) AS mean_temperature,
MAX(temperature) AS max_temperature
FROM external_measurements
GROUP BY location);

Using this approach, I got an average end to end time of 24.83 seconds, so similar to the
total time of the first approach with an internal stage and table copy operation. With caching
turned on, subsequent queries average 0.232 seconds.
I also tried seeing if I could improve performance with a materialized view. This turned out to
be much worse. The query time against the materialized view was about the same as the
external table, but there’s an additional cost to create the view, which took on average about
40 seconds.
CREATE OR REPLACE MATERIALIZED VIEW measurements_view AS

SELECT location, temperature FROM external_measurements;

Cranking Up My Warehouse Size


Throughout the prior experiments, I used a Large warehouse. Just for fun and to see how
things would compare, I spun up a 4X-Large warehouse instance, and re-ran the internal
stage steps I outlined previously.
With a 4X-Large warehouse, the table copy time was cut from 21.26 seconds to 14.66 seconds,
yielding a total time to copy and execute the query to 16.50 seconds.
I also tried querying the internal stage directly as before to avoid the table copy. This cut the
total cost from 12.33 seconds to 6.67 seconds.
Wrapping Up
This was a fun little challenge where I got to dive into some features of Snowflake I don’t use
on a regular basis.
The best performance I achieved with the Large warehouse was 12.33 seconds where I split
the file up into 100 smaller files, used an internal stage, and queried the stage directly. I’m
clearly cheating a bit by splitting the file up and not counting that pre-processing step towards
the total time.
Snowflake or any kind of database isn’t really the ideal approach to this challenge as the
challenge is all about speed to return the required string. There’s no value in the context of the
challenge to speeding up multiple fetches or having the flexibility to do other types of analysis.
But it’s fun to try :-).
If you have any suggestions or ideas about how to optimize the approach, please let me know.
## Question 1.
Difference between Truncate, Drop, Delete.
DROP statement can be used to remove any database objects like tables, views, functions,
procedures, triggers etc.
Delete is a DML statement hence we will need to commit the transaction in order to save the
changes to database. Whereas truncate and drop are DDL statements hence no commit is
required.
For example: Below statement will delete only the records from employee table where the
name is ‘Tanya’
DELETE FROM employee WHERE name = ‘Tanya’;
COMMIT;
Below statement will delete all records from the employee table.
DELETE FROM employee;
COMMIT;
Below statement will also delete all the records from the employee table. No commit is
required here.
TRUNCATE TABLE employee;
## Question 2.
Difference between RANK, DENSE_RANK and ROW_NUMBER window
function.
RANK() function will assign a rank to each row within each partitioned result set. If multiple
rows have the same value then each of these rows will share the same rank. However the rank
of the following (next) rows will get skipped. Meaning for each duplicate row, one rank value
gets skipped.
DENSE_RANK() function will assign a rank to each row within each partitioned result set.
If multiple rows have the same value then each of these rows will share the same rank.
However the dense_rank of the following (next) rows will NOT get skipped. This is the only
difference between rank and dense_rank. RANK() function skips a rank if there are duplicate
rows whereas DENSE_RANK() function will never skip a rank.
ROW_NUMBER() function will assign a unique row number to every row within each
partitioned result set. It does not matter if the rows are duplicate or not.
By using the managers table, let’s write a query to get the rank, dense rank and row number
for each manager based on their salary.
SELECT *
, RANK() OVER(ORDER BY salary DESC) AS ranks
, DENSE_RANK() OVER(ORDER BY salary DESC) AS dense_ranks
, ROW_NUMBER() OVER(ORDER BY salary DESC) AS row_numbers
FROM managers;
Table Name: MANAGERS
Contains salary details of 5 different managers.

Manager Table

Result from above query


## Question 3.
Difference between Unique, primary keys, foreign keys.
Primary key, unique key and foreign key are constraints we can create on a table.
When you make a column in the table as primary key then this column will always have
unique or distinct values. Duplicate values and NULL value will not be allowed in a
primary key column. A table can only have one primary key. Primary key can be created either
on one single column or a group of columns.
When you make a column in the table as unique key then this column will always have
unique or distinct values. Duplicate values will not be allowed. However, NULL values are
allowed in a column which has unique key constraint. This is the major difference between
primary and unique key.
Foreign key is used to create a master child kind of relationship between two tables. When
we make a column in a table as foreign key, this column will then have to be referenced from
another column from some other table.
## Question 4.
Difference between “WHERE ” and “HAVING”clause.
WHERE clause is used to filter records from the table. We can also specify join conditions
between two tables in the WHERE clause. If a SQL query has both WHERE and GROUP BY
clause then the records will first get filtered based on the conditions mentioned in WHERE
clause before the data gets grouped as per the GROUP BY clause.
Conditions specified in the WHERE clause are applied to individual rows in the table.
Whereas HAVING clause is used to filter records returned from the GROUP BY clause. So if
a SQL query has WHERE, GROUP BY and HAVING clause then first the data gets filtered
based on WHERE condition, only after this grouping of data takes place. Finally based on the
conditions in HAVING clause the grouped data again gets filtered.
Conditions specified in the HAVING clause are applied to aggregated values, not individual
rows.
## Question 5.
Difference between PARTITION BY and GROUP BY.
 The GROUP BY clause is used in combination with aggregate functions to
group rows based on one or more columns.
 It is typically used in queries where you want to perform aggregate calculations
(such as SUM, COUNT, AVG, etc.) on groups of rows that share common values in
specified columns.
 The GROUP BY clause is applied before the SELECT clause in the query
execution.

-- Using GROUP BY
SELECT department, AVG(salary) AS avg_department_salary
FROM employees
GROUP BY department;

Output:

| department | avg_department_salary |
|------------|-----------------------|
| HR | 52500.00 |
| IT | 65000.00 |

 The PARTITION BY clause is used with window functions, which are a set of
functions that perform calculations across a specific range of rows related to the
current row within the result set.
 PARTITION BY divides the result set into partitions to which the window
function is applied separately. It doesn't group the rows in the same way
as GROUP BY.
 The PARTITION BY clause is applied after the window function in the query
execution.

-- Using PARTITION BY
SELECT employee_id, department, salary,
AVG(salary) OVER (PARTITION BY department) AS
avg_department_salary
FROM employees;

Output:

| employee_id | department | salary | avg_department_salary |


|-------------|------------|----------|-----------------------|
| 1 | HR | 50000.00 | 52500.00 |
| 2 | HR | 55000.00 | 52500.00 |
| 3 | IT | 60000.00 | 65000.00 |
| 4 | IT | 65000.00 | 65000.00 |
| 5 | IT | 70000.00 | 65000.00 |

## Question 6.
Imagine there is a FULL_NAME column in a table which has values like “Elon
Musk“, “Bill Gates“, “Jeff Bezos“ etc. So each full name has a first name, a space
and a last name. Which functions would you use to fetch only the first name
from this FULL_NAME column? Give example.

SELECT
SUBSTR(full_name, 1, POSITION(' ' IN full_name) - 1) as first_name
FROM
your_table_name;
 SUBSTR(full_name, 1, POSITION(' ' IN full_name) - 1): This part
of the query uses the SUBSTR function to extract a substring from
the full_name column. The arguments are as follows:
 full_name: The source string from which the substring is extracted.
 1: The starting position of the substring (in this case, from the beginning of
the full_name).
 POSITION(' ' IN full_name) - 1: The length of the substring. It
calculates the position of the space (' ') in the full_name column using
the POSITION function and subtracts 1 to exclude the space itself.
 as first_name: This part of the query assigns the extracted substring an alias
"first_name" for the result set.
## Question 7.
How can you convert a text into date format? Consider the given text as “31–01–
2021“.
In SQL, the TO_DATE function is commonly used to convert a text representation of a date
into an actual date format. The syntax of the TO_DATE function varies across database
systems, but you provided an example that looks like it's intended for a system using the 'DD-
MM-YYYY' format.
Here’s an explanation of the SQL query you provided:

SELECT TO_DATE('31-01-2023', 'DD-MM-YYYY') as date_value;

TO_DATE('31-01-2021', 'DD-MM-YYYY'): This part of the query uses


the TO_DATE function to convert the text '31-01-2021' into a date format. The first
argument ('31-01-2021') is the text representation of the date, and the second
argument ('DD-MM-YYYY') is the format of the date in the input text.
 as date_value: This part of the query assigns an alias 'date_value' to the
result, which is the converted date.
## Question 8.
Why do we use CASE Statement in SQL? Give example
CASE statement is similar to IF ELSE statement from any other programming languages. We
can use it to fetch or show a particular value based on certain condition.
CASE statement in SQL is used to perform conditional logic within a query.
Here’s a simple example of using the CASE statement in a SELECT query:

SELECT
employee_name,
salary,
CASE
WHEN salary > 50000 THEN 'High Salary'
WHEN salary > 30000 THEN 'Medium Salary'
ELSE 'Low Salary'
END AS salary_category
FROM
employees;

In this example, the CASE statement is used to categorize employees based on their salary. If
the salary is greater than 50,000, the category is 'High Salary.' If the salary is between 30,000
and 50,000, the category is 'Medium Salary.' Otherwise, the category is 'Low Salary.'
## Question 9.
What is the difference between LEFT, RIGHT, FULL outer join and INNER
join?

Joins
In order to understand this better, let’s consider two tables CONTINENTS and COUNTRIES
as shown below. I shall show sample queries considering these two tables.
Table Name: CONTINENTS
Has data of 6 continents. Please note the continent “Antarctica” is intentionally missed from
this table.

Table Name: COUNTRIES


Has data of one country from each continent. Please note that I have intentionally missed to
add a country from Europe in this table.
INNER JOIN will fetch only those records which are present in both the joined tables. The
matching of the records is only based on the columns used for joining these two tables.
INNER JOIN can also be represented as JOIN in your SELECT query.
INNER JOIN Query
SELECT cr.country_name, ct.continent_name
FROM continents ct
INNER JOIN countries cr
ON ct.continent_code = cr.continent_code;
LEFT JOIN will fetch all records from the left table (table placed on the left side during the
join) even if those records are not present in right table (table placed on the right side during
the join). If your select clause has a column from the right table then for records which are not
present in right table (but present in left table), SQL will return a NULL value. LEFT JOIN
can also be represented as LEFT OUTER JOIN in your SELECT query.

LEFT JOIN Query


SELECT cr.country_name, ct.continent_name
FROM continents ct
LEFT JOIN countries cr
ON ct.continent_code = cr.continent_code;
RIGHT JOIN will fetch all records from the right table (table placed on the right side during
the join) even if those records are not present in left table (table placed on the left side during
the join). If your select clause has a column from the left table then for records which are not
present in left table (but present in right table), SQL will return a NULL value. RIGHT JOIN
can also be represented as RIGHT OUTER JOIN in your SELECT query.
*Note: LEFT and RIGHT join depends on whether the table is placed on the
left side of the JOIN or on the right side of the JOIN.

RIGHT JOIN Query


SELECT cr.country_name, ct.continent_name
FROM continents ct
RIGHT JOIN countries cr
ON ct.continent_code = cr.continent_code;
FULL JOIN will fetch all records from both left and right table. It’s kind of combination of
INNER, LEFT and RIGHT join. Meaning FULL JOIN will fetch all the matching records in
left and right table + all the records from left table (even if these records are not present in
right table) + all the records from right table (even if these records are not present in left
table). FULL JOIN can also be represented as FULL OUTER JOIN in your SELECT query.
FULL OUTER JOIN Query
SELECT cr.country_name, ct.continent_name
FROM continents ct
FULL OUTER JOIN countries cr
on ct.continent_code = cr.continent_code;
Also check for what is SELF join, NATURAL join and CROSS join?
SELF JOIN is when you join a table to itself. There is no keyword like SELF when doing this
join. We just use the normal INNER join to do a self join. Just that instead of doing an inner
join with two different table, we inner join the same table to itself. Just that these tables
should have different alias name. Other than this, SELF join performs similar to INNER join.

SELF JOIN Query


SELECT cr1.country_name
FROM countries cr1
JOIN countries cr2ON cr1.country_code = cr2.continent_code;
NATURAL JOIN is similar to INNER join but we do not need to use the ON clause during
the join. Meaning in a natural join we just specify the tables. We do not specify the columns
based on which this join should work. By default when we use NATURAL JOIN, SQL will join
the two tables based on the common column name in these two tables. So when doing the
natural join, both the tables need to have columns with same name and these columns should
have same data type.

NATURAL JOIN Query


SELECT cr.country_name, ct.continent_name
FROM continents ct
NATURAL JOIN countries cr;
CROSS JOIN will join all the records from left table with all the records from right table.
Meaning the cross join is not based on matching any column. Whether there is a match or
not, cross join will return records which is basically number of records in left table multiplied
by number of records in right table. In other words, cross join returns a Cartesian product.
CROSS JOIN Query
SELECT cr.country_name, ct.continent_name
FROM continents ct
CROSS JOIN countries cr;
## Question 10.
Can we use aggregate function as window function? If yes then how do we do it?
Yes, we can use aggregate function as a window function by using the OVER clause. Aggregate
function will reduce the number of rows or records since they perform calculation of a set of
row values to return a single value. Whereas window function does not reduce the number of
records.
Now, let’s use the SUM function as a window function to calculate the running total salary
within each department based on the order of salaries:

SELECT
employee_id,
employee_name,
department,
salary,
SUM(salary) OVER (PARTITION BY department ORDER BY salary) AS
running_total_salary
FROM
employees;

Output:

| employee_id | employee_name | department | salary |


running_total_salary |
|-------------|---------------|------------|----------|---------------
--------|
| 1 | John | HR | 50000.00 | 50000.00
|
| 3 | Bob | HR | 55000.00 | 105000.00
|
| 2 | Jane | IT | 60000.00 | 60000.00
|
| 4 | Alice | IT | 70000.00 | 130000.00
|

In this example, the running_total_salary column represents the running total salary
within each department, calculated based on the ascending order of salaries. The PARTITION
BY clause is used to partition the result set by the department column. The ORDER
BY clause specifies the order of rows within each partition based on the salary column.
The SUM function is applied as a window function, and it calculates the running total salary
for each row within its department.
Snowflake Performance Challenges & Solutions (Part 1)

Slim Baltagi
·
Follow
29 min read
·
Nov 8, 2023

Disclaimer: The opinions in this two-part blog series are entirely mine and do not necessarily
reflect my employers (past, present, or future) opinions. This series is neither a critique nor a
praise of the Snowflake Data Cloud. I am attempting to cover a complex topic, Snowflake
Performance Challenges & Solutions, with the hope of paving the way to open, honest, and
constructive related discussions for the benefit of all.
For your convenience, here is the table of contents so you can quickly jump to the item or
items that interest you the most.
TABLE OF CONTENTS
I. Introduction
II. Snowflake Technology (9)
1. Misperceptions and confusion from some Snowflake marketing messages
2. Inherent Snowflake performance limitations
2.1. Performance was not originally a major focus of Snowflake
2.2. Out-of-the-box limited concurrency
2.3. Heavy reliance on caching
2.4. Data clustering limitations
2.5. Vertical scaling is a manual process
2.6. Initial poor performance of ad-hoc queries
2.7. No exposure of functionality like query optimizer hints
2.8. Rigid virtual warehouse sizing leads to computing waste and overspending
2.9. Missing out on performance optimizations from Open File Formats
2.10. No uniqueness enforcement
2.11. No full separation of storage and compute
2.12. Heavy scans
2.13. Out-of-the-box concurrent write Limits
2.14. Data Latency
2.15. Inefficient natural partitioning as the data grows
2.16. Shared Cloud Services layer performance degradation
2.17. No Snowflake query workload manager
2.18. Significant delay in getting large output Resultset in Snowflake UI
3. Forced spending caused by built-in Snowflake service offerings
4. Not enough performance tuning tools and services
5. Overall performance degradation due to data, users, and query complexity growth
6. Performance limitations due to cloud generic hardware infrastructure
7. Dependency on cloud providers and risk of services downtime
8. Limitations in the Query Profile
9. Lack of documentation of error codes and messages
III. Snowflake Ecosystem (3)
1. Inefficient integration with Snowflake
2. Best practices knowledge required by third-party tools and services
3. Piecemeal information about Snowflake performance optimization and tuning
IV. Snowflake Users (4)
1. Snowflake users need to get familiar with Snowflake’s ways of dealing with
performance
2. Snowflake users take shortcuts for solving performance issues
3. Snowflake users lack the knowledge about performance optimization and tuning
4. Snowflake users face a steep learning curve of Snowflake performance
optimization and tuning
4.1. Long-running queries
4.2. Complex queries
4.3. Inefficient queries
4.4. Storage Spillage
4.5. Concurrent queries
4.6. Outlier queries
4.7. Queued queries
4.8. Blocked queries
4.9. Queries against the metadata layer
4.10. Inefficient Pruning
4.11. Inefficient data model
4.12 Connected clients issues
4.13. Lack of sub-query pruning
4.14. Queries not leveraging results cache
4.15. Queries doing a point lookup
4.16. Queries against very large tables
4.17. Latency issues
4.18. Heavy Scanning
4.19. Slow data load speed
4.20. Slow ingestion of streaming data
I. Introduction
You might not be satisfied with the performance of your Snowflake account and need to
optimize it for one or all of the following reasons:
1. Data & Users Growth: As the underlying data grows or the number of users
querying the data increases, queries might start seeing performance issues.
2. Better end-user experience: Users of executive dashboards and data applications
powered by Snowflake would require a better experience.
3. Cost: Striking a balance between performance and cost might be one of your goals
unless you have an unlimited budget! Tuning long-running queries, often
executed, help reduce cost.
4. Service Level Agreement (SLA): Specific use cases might require to meet SLA,
otherwise the business can be negatively impacted. For example, a specific query
should return results in less than 10 seconds.
5. Critical path: Queries rely on the results of other queries. For example, queries
transforming data or reading data would depend on queries loading data.
You might then need some help with understanding Snowflake’s performance challenges and
related solutions. That’s why I am writing this 2-part blog series!
In this first part, I focus on the performance challenges of Snowflake based on a
multidimensional approach that addresses Snowflake technology, Snowflake ecosystem, and
Snowflake users. As users, let’s not blame the technology if we are misusing or abusing
Snowflake! As vendors, let’s not blame the technology if our Snowflake integration is not
efficient or optimal!
In the second part, I will propose solutions for Snowflake performance tuning and
optimization with a unique step-by-step approach based on needs, symptoms, diagnoses, and
remediations.
II. Snowflake Technology (6)
1. Misperceptions and confusion from some Snowflake marketing messages
As a SaaS, Snowflake does not require the management of physical hardware, the installation
of software, and related maintenance. The Snowflake data platform is continually updated in
the background without the need for user involvement. Under the hood, Snowflake takes care
of many performance-related aspects that are usually the responsibilities of the customers in
other data warehouses and data platforms. Examples include horizontal or vertical data
partitioning to specify, data shards for even distribution across nodes, vacuuming, data
statistics collection and maintenance, and distribution Keys.
Many misperceptions and confusions about Snowflake performance tuning and optimization
are due to claims such as:
 ‘With the arrival of the cloud-built data warehouse, performance optimization
becomes a challenge of the past’. This is claimed by Snowflake Inc. in this white
paper titled ‘How Snowflake Automates Performance in a Modern Cloud Data
Warehouse’ and published on October 16, 2019.
 ‘Using Snowflake, everyone benefits from performance automation with very
little manual effort or maintenance’. This is claimed by Snowflake inc. in
this white paper titled ‘How Snowflake Automates Performance in a Modern
Cloud Data Warehouse’ and published on October 16, 2019.
 Insights into Snowflake’s Near-Zero Management a recorded presentation
published on January 23, 2020. Here is an example of the reaction of a Snowflake
customer to a such statement of ‘Zero Management or Near Zero Management’,
as reported by a Snowflake employee in his blog: “I was recently working for a
major UK customer, where the system manager said ‘Snowflake says it needs
Zero Management, but surely that’s just a marketing ploy’.
 Automatic Query Optimization. No Tuning!, a blog by Snowflake Inc. published
on May 19, 2016. Nevertheless, Snowflake offers a USD 3,000 Snowflake
Performance Automation and Tuning 3-Day Training!
 Snowflake Data Management — No Admin Required, a recorded presentation by
Snowflake Inc. published on January 13, 2020. Nevertheless, Snowflake Inc.
offers role-based USD 3,000 ‘Administering Snowflake Training’ and USD 375
‘SnowPro Advanced Administrator Certification!
 Pacific Life- Busting Bottlenecks for Data Scientists With 1,800x Faster Query
Performance, A case study from Snowflake Inc.
 Scale a near-infinite amount of computing resources, up or down, with just a few
clicks. Actually, you might run a query and get the error: “Maximum number of
servers for the account exceeded”. See the article from Snowflake: Query Failed
with Error: Max number of servers for the account exceeded
Such statements not backed by facts from Snowflake Inc. would be considered marketing fluff
and do a disservice to Snowflake Data Cloud technology due to customers not being satisfied
with them and competition exploiting them.
2. Inherent Snowflake performance limitations
2.1. Performance was not originally a major focus of Snowflake: This is a quote from
Snowflake founders in their paper The Snowflake Elastic Data Warehouse by Snowflake
Computing: ‘… Snowflake has only one tuning parameter: how much performance the user
wants (and is willing to pay for). While Snowflake’s performance is already very competitive,
especially considering the no-tuning aspect, we know of many optimizations that we have not
had the time for yet. Somewhat unexpected though, core performance turned out to be almost
never an issue for our users. The reason is that elastic compute via virtual warehouses can
offer the performance boost occasionally needed. That made us focus our development efforts
on other aspects of the system.’
Although Snowflake kept adding new services to improve performance such as Materialized
Views, Auto Clustering, Query Acceleration Service, Search Optimization and a few
transparent enhancements such as data compression rate for new data loaded in Snowflake,
ability to eliminate joins on key columns, … Most either came at additional costs or did not
solve many of its inherent performance limitations as evidenced in the below list.
2.2. Out-of-the-box limited concurrency: 8 concurrent queries per warehouse by default.
Autoscaling up to 10 warehouses. On a single-cluster virtual warehouse, you might hit a limit
of eight concurrent queries. 8 is the default value of the Snowflake
parameter MAX_CONCURRENCY_LEVEL that defines the maximum number of parallel or
concurrent statements a warehouse can execute. See also the article from Snowflake
Knowledge Base Warehouse Concurrency and Statement Timeout Parameters, published on
August 16, 2020. Such a low query concurrency limit forces either increasing the size of a
single-cluster virtual warehouse or starting additional clusters in the case of a multi-cluster
virtual warehouse (in Auto-scale mode). In both cases, this forces you to burn even more
credits compared to what default concurrency you get out of the box from Snowflake!
This is an answer from the Snowflake product team that I am posting as is:” We don’t have a
maximum of 8 concurrent queries: Check out this documentation for more
details: Parameters — Snowflake Documentation Note that this parameter does not limit the
number of statements that can be executed concurrently by a warehouse cluster. Instead, it
serves as an upper-boundary to protect against over-allocation of resources. As each
statement is submitted to a warehouse, Snowflake allocates resources for executing the
statement; if there aren’t enough resources available, the statement is queued or additional
clusters are started, depending on the warehouse.
The actual number of statements executed concurrently by a warehouse might be more or less
than the specified level:
 Smaller, more basic statements: More statements might execute concurrently
because small statements generally execute on a subset of the available compute
resources in a warehouse. This means they only count as a fraction towards the
concurrency level.
 Larger, more complex statements: Fewer statements might execute
concurrently.”
2.3. Heavy reliance on caching: Heavy reliance of Snowflake on caching can result in
unpredictable and non-optimal performance. At its core, Snowflake is built on a caching
architecture which works well on small scale data sets or repetitive traditional queries. This
architecture starts to fall down as data volumes expand or the workload complexity increases.
In the emerging space of advanced analytics, where machine learning, artificial intelligence,
graph theory, geospatial analytics, time-series analysis, adhoc analysis, and real-time
analytics are becoming predominant in every enterprise — data sets are typically larger and
workloads are becoming much more complex.
This is recognized by Snowflake Inc! “Since end-to-end query performance depends on both
cache hit rate for persistent data files and I/O throughput for intermediate data, it is
important to optimize how the ephemeral storage system splits capacity between the two.
Although we currently use the simple policy of always prioritizing intermediate data, it may
not be the optimal policy with respect to end-to-end performance objectives.” Excerpt from
this presentation and related paper titled ‘Building An Elastic Query Engine on Disaggregated
Storage’ and published in February 2020. PDF (15 pages), Video (19' 57")
2.4. Data clustering limitations: Time-based data is loaded by Snowflake in natural ingestion
order and helps gain performance benefits, by eliminating unnecessary reads through the
combination of automatically creating micro-partitions to hold data in them and
automatically capturing statistics, without any further action required from the user. Such
behavior is not guaranteed to happen if you are loading your data in a random sequence or
using multiple parallel load processes.
Oftentimes, the natural ingestion order is not the optimal physical ordering of data for
customer workload patterns. The user can define a set of key(s) to create a clustered table
where the underlying data is physically stored in the order of a user-defined set of key(s).
Selecting proper clustering keys is critical and requires an in-depth understanding of the
common workloads and access patterns against the table in question. Once the user selects
the proper keys, he can benefit from performance gains through the Snowflake automatic
clustering service.
Snowflake automatic clustering comes with some limitations. Examples include cost-
ineffectiveness for tables that change frequently and clustering jobs that are not always smart.
Snowflake is still improving and optimizing the automatic clustering service but did not
publish the related roadmap.
2.5. Vertical scaling is a manual process: Vertical scaling or scaling up and down by resizing a
virtual warehouse is a manual process. It is also a common misconception to think that the
only solution available to improve query performance is to scale up to a bigger warehouse!
2.6. Initial poor performance of ad-hoc queries: When there is a need to answer questions not
already solved with predetermined or predefined queries and datasets, users create Ad-hoc
queries. For example, analysts might write ad-hoc queries for immediate data exploration
needs that tend to be heavy. Most of the time, analysts might not know what virtual
warehouse size to use, how to tune these queries, or whether it does make sense to tune them.
Related best practices would be to:
 Isolate such ad-hoc queries by using a separate virtual warehouse to prevent
them from affecting the performance of other workloads.
 Run such ad-hoc queries on bigger warehouses! They will end up running faster
and the cost would be the same compared to running slower in smaller
warehouses.
 Use Snowflake Query Acceleration Service (QAS), a feature built into all
Snowflake Virtual Warehouses, to improve Ad-hoc query performance by
offloading large table scans to the QAS service.
2.7. No exposure of functionality like query optimizer hints: Snowflake does not expose
functionality like query optimizer hints that are found in other databases and data
warehouses to control the order in which joins are performed for example.
In some situations, it is not possible for Snowflake optimizer to identify the join ordering that
would result in the fastest execution. You might need to rewrite your query using an approach
that guarantees that the joins are executed in your preferred order. See this Snowflake
Knowledge Base article that is published on May 28, 2019 and titled How To: Control Join
Order
This is an answer from the Snowflake product team that I am posting as is: “This is not a
limitation but a design of Snowflake, to avoid common problems associated with join order
hints such as the query (including joining many tables) are going to have their join order
forced and this render the query very brittle and fragile. if the underlying data changes in the
future, you could be forcing multiple inefficient join orders. Your query that you tuned with
join order could go from running in seconds to minutes or hours.”
Update: Join elimination is a new feature in Snowflake that takes advantage of foreign key
constraints. In a way, this is the first optimizer hint in Snowflake! When two tables are joined
on a column, you have the option to annotate those columns with primary & foreign key
constraints using the RELY property. Setting this property tells the optimizer to check the
relationship between the tables during query planning. If the join isn’t needed, it will be
removed entirely.
2.8. Rigid virtual warehouse sizing leads to computing waste and overspending: Snowflake
virtual warehouses come in fixed sizes that must be manually scaled to the next instance
doubling size and cost to match query complexity. For example, a Large virtual warehouse ( 8
nodes) would need to go to an X-Large virtual warehouse (16 nodes), even if meeting current
query complexity demands would require only one more node. Relying on fixed warehouse
sizing that doubles warehouse sizes and costs every time a little more query performance is
needed leads to computing waste and overspending.
2.9. Missing out on performance optimizations from Open File Formats: Snowflake users
might miss out on performance optimizations that common Open file formats such as Apache
Arrow, Apache Parquet, Apache Avro, and Apache ORC offer. Snowflake can import data
from these Open formats to its proprietary file format (FDN: Flocon De Neige) but can not
directly work with these open file formats.
With the announcement of the new Snowflake feature called Iceberg table format, Snowflake
is adding support for the Apache Parquet file format. Iceberg Tables are in a private preview
as of the publishing date of this article and are not publicly available to Snowflake customers
yet. Snowflake announced on January 21, 2022, expanded support for Iceberg via External
Tables. At the Snowflake Summit on June 14, 2022, Snowflake announced a new type of
Snowflake table called Iceberg Tables: “In this 6 minutes demo, Snowflake Software Engineer
Polita Paulus shows you how a new type of Snowflake table, called an Iceberg Table, extends
the features of Snowflake’s platform to Open formats, Apache Iceberg and Apache Parquet, in
storage managed by customers. You can work with Iceberg Tables as you would with any
Snowflake table, including being able to apply native column-level security, without losing the
interoperability that an open table format provides.”
You might wonder what is the difference between support of Snowflake of Apache Iceberg via
External Tables or Iceberg Tables. External tables are read only while Iceberg tables allow
read, insert and update.
2.10. No uniqueness enforcement: There is no way to enforce uniqueness in inserted data. If
you have a distributed system and it writes data on Snowflake, you will have to handle the
uniqueness yourself either on the application layer or by using some method of data de-
duplication.
Snowflake announced Unistore, not yet in public preview. This means Snowflake has a new
Hybrid Table Type that allows Unique, Primary, and Foreign Key constraints. Here’s a 6-
minute youtube demo.
2.11. No full separation of storage and compute

“Since end-to-end query performance depends on both cache hit rate for persistent data files
and I/O throughput for intermediate data, it is important to optimize how the ephemeral
storage system splits capacity between the two. Although we currently use the simple policy of
always prioritizing intermediate data, it may not be the optimal policy with respect to end-to-
end performance objectives.” Excerpt from this presentation and related paper titled ‘Building
An Elastic Query Engine on Disaggregated Storage’ and published in February 2020. PDF (15
pages), Video (19' 57")
2.12. Heavy scans: Before copying data, Snowflake checks that files have not already been
loaded. This will affect load performance. To avoid scanning terabytes of files that have
already been loaded, you can simply partition your staged data files! This will help maximize
your load performance.
2.13. Out-of-the-box concurrent write Limits: Snowflake has a built-in limit of 20 DML
statements that target the same table concurrently, including COPY, INSERT, MERGE,
UPDATE, and DELETE. Snowflake users might not be aware of such a concurrent write limit
as this is not in Snowflake documentation. The root cause seems to be related to constraints
in the Global Service Layer’s metadata repository database. You need to be aware that
Snowflake is designed for high-volume high-concurrency reading and not writing.
To overcome such a performance challenge of concurrent write limits in Snowflake, you need
to architect around it as explained in this article published on July 26, 2019.
2.14. Data Latency: Latency issues in Snowflake can be due to many reasons. Examples
include Snowflake and other data services not in the same cloud region. To avoid such latency
issues, you’d need to have Snowflake on the same public cloud and the same or closer cloud
region where your data, users, and other services reside.
2.15. Inefficient natural partitioning as the data grows: By default, Snowflake auto clusters
your rows according to their insertion order. Because insertion order is often correlated with
dates, and dates are a popular filtering condition, this default clustering works quite well in
many cases. However, when you have too many DML statements after your initial data load,
your clustering may lose its effectiveness over time: more overlap means fewer pruning
opportunities. Using SYSTEM$CLUSTERING_DEPTH system function helps with seeing
how effective your micro-partitions are in terms of a specified set of columns. It returns the
average depth as an indicator of how much your micro-partitions overlap. The lower this
number, the less overlap, the more pruning opportunities, and the better your query
performance when filtering on those columns.
2.16. Shared Cloud Services layer performance degradation: Snowflake shared Cloud Service
Layer in a particular region can become overloaded.
This is a real-world interaction with Snowflake Customer Service with their analysis of why
our metadata queries either were extremely slow or timed out: “Our team has reviewed the
information provided. We found on September 1, 2022 there was an issue with our Cloud
Services servers encountering heavier than normal usage. As a result, some queries had
timeouts reading metadata resulting in incidents. This issue with the servers was identified by
the Snowflake team during the event and steps were taken to resolve the issue. We apologize
for the inconvenience. Thanks”. A potential solution is for Snowflake to predictively scale
Cloud Services clusters prior to hitting any resource limits.
In addition, an outage of the shared Cloud Services will affect multiple customers. Getting
technical support will become a big challenge.
2.17. No Snowflake query workload manager: Unlike other data warehouses, Snowflake lacks
a resource manager to assign resources to queries based on their importance and overall
system workload.
2.18. Significant delay in getting large output Resultset in Snowflake UI: You might
experience query slowness due to fetching a large output resultset.
You can set this parameter UI_QUERY_RESULT_FORMAT to ARROW at the
session/account/user level and test if you are getting the results faster:
alter session set UI_QUERY_RESULT_FORMAT=’ARROW’;
This post, How To: Resolve Query slowness due to large output published on March 28, 2022,
provided the solution to avoid query slowness due to fetching large output resultset.
3. Forced spending caused by built-in Snowflake service offerings
The use of performance-related additional services from Snowflake forces a cost increase for
users and challenges them to strike a balance between performance and cost:
3.1. Automatic Clustering Service: Reorganize table data to align with query patterns.
Clustering in Snowflake relates to colocating rows with other similar rows in a micro
partition. Data that is well clustered can be queried faster and more affordably due to
partition pruning that allows Snowflake to skip data that does not pertain to the query based
on statistics of the micro partitions. When clustering is defined, the Automatic Clustering
service, in the background, rewrites micro-partitions to group rows with similar values for the
clustering columns in the same micro-partition.
Snowflake users might make some clustering choices that don’t have any performance gains
and are a waste of spending:
DO NOT: Cluster by a timestamp when most predicates are by DAY.
DO NOT: Cluster on a VARCHAR field that has low cardinality prefix
DO NOT: Try to compensate for starting character cardinality with a HASH function
DO NOT: Change clustering in production on a table w/o a manual rewrite
This quick write-up Tips for Clustering in Snowflake will help you understand what clustering
is on Snowflake, Why it is important, How to pick cluster keys, and what not to do.
3.2. Materialized Views: To boost query performance for workloads that contain a lot of the
same queries, you might create materialized views to store frequently used projections and
aggregations. Although the results from related materialized views queries are guaranteed to
be up-to-date, there are extra costs associated with those materialized views. As a result,
before you create any materialized views, think about whether the expenses will be
compensated by the savings from reusing the results frequently enough.
3.3. Search Optimization Service: To quickly find the needle in the haystack to return a small
number of rows on a large table, you might enable search optimization on that table. The
maintenance service begins constructing the table’s search access paths in the background
and might massively parallelize related job. This might result in rapid increase in spending.
Before enabling search optimization on a large table, you can get an estimate of the spending
using SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION_COSTS so you know what you’re in
for.
3.4. Query Acceleration Service: Elastic scale out of computing resources without changing
the size of virtual warehouses. See What is Snowflake Query Acceleration Service?
4. Not enough performance tuning tools and services
Despite already celebrating its 10th anniversary, Snowflake is considered relatively new and
evolving fast compared to legacy data warehouses and platforms that have matured over
decades and provided fine-grained tuning tools that DBAs typically use.
Although Snowflake features for investigating slow Snowflake performance issues, such as
query profile, query history page, and warehouse loading charts, offer valuable data and
insights, there is a need for more tuning tools and services. For example, some additional
tools that might be helpful would be Clustered Tables Explorer & Analyzer, Warehouse Idle
Time Alerter, …
5. Overall Performance degradation due to data, users, and query complexity
growth
Queries might start seeing performance issues as underlying data grows or the number of
users querying data increases. Although performance degradation is not specific to
Snowflake, users find it difficult to fix it in Snowflake with limited tuning options. They lack
features, such as indexes and workload managers, they are used from other data platforms
and tools to investigate and fix Snowflake performance issues.
6. Performance limitations due to cloud generic hardware infrastructure
Snowflake runs on generic and general-purpose hardware infrastructure available in AWS,
GCP, or Azure. Such cloud hardware limits Snowflake’s performance as it is not optimized for
data warehousing workloads and other workloads that Snowflake supports.
In addition, unlike competitors, Snowflake does not offer any performance tuning through
hardware selection. Snowflake does not disclose its hardware specs to allow customizing
performance through hardware. Such lack of transparency of Snowflake about its hardware
specs reduces the flexibility in customizing Snowflake performance through hardware beyond
simply selecting a virtual warehouse size without knowing underlying details or having any
other hardware-related choices.
At its 2022 Summit, Snowflake announced ‘Transparent engine updates’: “For those of you
running on AWS, you will get faster performance for all of your workloads. We’ve optimized
Snowflake to take advantage of new hardware improvements offered by AWS, and we are
seeing 10% faster compute on average in the regions already rolled out. No user intervention
or choosing a particular configuration is required for this latest performance enhancement.”
See blog post published on July 27, 2022 and titled Snowflake’s New Engine and Platform
Announcements
Snowflake announced at its 2022 summit, large memory instances for ML workloads to
enable the training of models inside Snowflake. See demo titled ‘Train And Deploy Machine
Learning Models With Snowpark For Python’ and published on June 14, 2022
7. Dependency on cloud providers and risk of services downtime
Snowflake depends on many services provided by cloud providers such as AWS, GCP, and
Azure. When such services suffer a performance degradation or go down, this directly affects
your Snowflake account.
Downtime is a disadvantage of cloud computing as cloud services, including Snowflake, are
not immune to such outages or slowdowns. That is an unfortunate possibility and can occur
for any reason. Your business needs to assess the impacts of an outage, slowdown, or any
planned downtime from Snowflake, follow best practices for minimizing impact or slowdowns
and outages, and implement solutions such as Business Continuity and Disaster Recovery
plans. You can subscribe to Snowflake status updates here.
8. Limitations in the Query Profile
Snowflake offers Query Profile to analyze queries. In Snowflake SnowSight UI, in the Query
Profile view, there is a section called Profile Overview where you can see the breakdown of the
total execution time. It contains statistics like Processing, Local Disk I/O, Remote Disk I/O,
Synchronization etc. At this time, there is no way in Snowflake to access those statistics
programmatically instead of having to navigate to that section for each query that you want to
analyze. This is a frequent request from customers and it might get offered one day!
Meanwhile, a couple workarounds are to use the open source project ‘Snowflake Snowsight
Extensions’ that enables manipulation of Snowsight features from command-line. You can
also wait for GET_QUERY_STATS, system function that returns statistics about the
execution of one or more queries, available in the query profile tab in Snowsight, via a
programmatic interface. It is still in private preview as of the date of publication of this article.
Update: GET_QUERY_OPERATOR_STATS(), in public preview available to all accounts, is a
system function that returns statistics about individual query operators within a query. You
can run this function for any query that was executed in the past 14 days. Such statistics,
available in the query profile tab in Snowsight, are now available via a programmatic
interface. See blog from Snowflake Inc. published on December 20, 2022 and titled Analyze
Your Query Performance Like Never Before with Programmatic Access to Query Profile.
Update: ‘Programmatic Access to Query Profile Statistics, in public preview soon, which will
let customers analyze long-running and time-consuming queries more easily. This view will
also identify (and let users resolve) performance problems such as exploding joins before they
impact the end user.’ See blog from Snowflake published on November 8th, 2022: A Faster,
More Efficient Snowflake Takes the Stage at Snowday
9. Lack of documentation of error codes and messages
Snowflake might display cryptic error messages that are not available in Snowflake
documentation. You need to contact Snowflake technical support or dig into source code of
connectors and drivers in GitHub repositories. This does not help with all parts of Snowflake
data platform. Although Snowflake is made aware of this issue, it does not have any plan on
adding related documentation at this time!
III. Snowflake Ecosystem (3)
1. Inefficient integration with Snowflake
Code generated by third-party tools to connect and integrate to Snowflake could be inefficient
from a performance perspective. For Example, metadata-related queries generated as
introspection code can be extremely slow
2. Best practices knowledge required by third-party tools and services
Almost every third-party tool and service from the Snowflake ecosystem comes with an
exhaustive list of best practices to follow! The burden is on Snowflake users to be aware of
such best practices and implement them when integrating with Snowflake.
3. Piecemeal information about Snowflake performance optimization and
tuning
Some information about Snowflake’s performance, available piecemeal here and there, lacks
structure, needs consolidation, and might be outdated or misleading.
IV. Snowflake Users (3)
1. Snowflake users need to get familiar with Snowflake’s ways of dealing
with performance
Users familiar with their legacy systems and migrating to Snowflake using a lift and shift
approach might encounter performance issues in their Snowflake account. Some of the old
ays they used to with their legacy systems such as indexes and won’t work in Snowflake! They
would need to tune their Lift and Shift approach and take a fresh look at their performance
problems.
2. Snowflake users take shortcuts for solving performance issues
Snowflake makes it incredibly easy for users to try and fix slow workloads by simply throwing
more compute resources at them. Snowflake charges for services such as automated
clustering that help optimize performance when used properly.
Instead of identifying the root causes of performance problems, following known Snowflake
performance best practices, and avoiding anti-patterns, Snowflake users take shortcuts. To
solve some of their performance problems, they opt for brute force and the ease of use of
features such as scaling up, scaling out, and automatic clustering at the cost of unnecessarily
computing costs!
3. Snowflake users lack the knowledge about performance optimization and
tuning
Snowflake users are a broad mix of business and technical users with varying levels of
Snowflake proficiency. As a result, they might have inefficient queries and processes of data
ingestion, transformation, and consumption.
4. Snowflake users face a steep learning curve of Snowflake performance
optimization and tuning
Like any new change or experience, it can be jarring to transition to Snowflake and learn more
about its performance optimization and tuning. In addition, as Snowflake keeps quickly
evolving and adding new features and functionality, you need to keep yourself up-to-date to
get the most out of your Snowflake account. Snowflake does not remove the need for highly
skilled technical resources to operate and tune a Snowflake account.
Here are some examples of Snowflake performance aspects you need to keep learning about
and related challenges you need to solve:
4.1. Long-running queries: You might need to seek and destroy long running queries that are
result of mistakes, are eating up compute resources and are not adding any value! This blog
posted on September 5th 2022 and titled Snowflake: Long Running Queries Seek &
Destroy shows an example of a long running query and 2 ways on how to solve this problem.
Using QUERY_HISTORY, you can accurately identify long running queries by looking at their
EXECUTION_TIME. Please note, that the TOTAL_ELAPSED_TIME includes the time the
query spent sitting on queue as the warehouse is being provisioned or overloaded. See this
blog published on January 10, 2022 and titled Long Running Queries in Snowflake:
QUERY_HISTORY function.
4.2. Complex queries: If a query is spending more time compiling compared to executing,
perhaps it is time to review the complexity of the query. See article Understanding Why
Compilation Time in Snowflake Can Be Higher than Execution Time
4.3. Inefficient queries: Although tuning and optimizing SQL queries is a big topic, some
known guardrails could help you pick some low-hanging fruits of performance improvement.
Examples include
 Select *: When fetching required attributes, avoid using the SELECT * statement
as it conveys all the attributes from the storage to the warehouse cache. This
slows down the process and fills the warehouse cache with unwanted data. Forget
about SELECT * queries! Snowflake is a columnar data store, explicitly write only
the columns you need or use or specify the columns that should be excluded from
the results using EXCLUDE within a SELECT * EXCLUDE (col_name, col_name,
…). See Selecting All Columns Except Two or More Columns
 Exploding joins: This happens when joining tables without providing a join
condition (resulting in a “Cartesian product”), or providing a condition where
records from one table match multiple records from another table. For such
queries, the Join operator produces significantly (often by orders of magnitude)
more tuples than it consumes. This is a common mistake to avoid when writing
code to join tables and to identify using the Query Profile. See related Snowflake
documentation “Exploding” Joins and the Snowflake Knowledge Base article
published on January 15, 2019 and titled How To: Recognize Row Explosion
 UNION without ALL: In SQL, it is possible to combine two sets of data with
either UNION or UNION ALL constructs. The difference between them is that
UNION ALL simply concatenates inputs, while UNION does the same, but also
performs duplicate elimination. A common mistake is to use UNION when the
UNION ALL semantics are sufficient. These queries show in Query Profile as a
UnionAll operator with an extra Aggregate operator on top (which performs
duplicate elimination). The best practice os to use UNION ALL instead of UNION
when deduplication is not required (or if we know that the rows being combined
are already distinct). See related Snowflake documentation UNION Without ALL.
 Not using ANSI join: “There can be potential performance differences between
non-ANSI and ANSI join syntax as the parser does not currently
transform/rewrite the join using the ANSI syntax after parsing. As a result, Non-
ANSI syntax doesn’t benefit from the subsequent steps of transformation and
optimization by the compiler that eventually generate a good plan.” See
Snowflake Knowledge Base article, published on June 18, 2022, and
titled Performance Implications of using Non-ANSI syntax versus ANSI join
syntax.
 Using String instead of Date or Timestamp data type: Date/Time Data Types for
Columns: “When defining columns to contain dates or timestamps, Snowflake
recommends choosing a date or timestamp data type rather than a character data
type. Snowflake stores DATE and TIMESTAMP data more efficiently than
VARCHAR, resulting in better query performance. Choose an appropriate date or
timestamp data type, depending on the level of granularity required.” .
See Snowflake Table Design considerations
People might learn SQL on the job and make their fair share of mistakes, including similar
queries to the above ones. They can learn how to write SQL more efficiently. by taking an
online SQL class or a training that focus on SQL for Snowflake.
4.4. Storage Spillage: For some operations (e.g. duplicate elimination for a huge data set), the
amount of memory available for the compute resources used to execute the operation might
not be sufficient to hold intermediate results. As a result, the query processing engine will
start spilling the data to the local disk. If the local disk space is not sufficient, the spilled data
is then saved to remote disks. Disk drives are a lot slower than ram. This spilling can have a
profound effect on query performance, especially if a remote disk is used for spilling. By using
a larger sized warehouse, you also get more ram. Thus, the query or load can complete so
much faster that you are saving more than the extra cost of being large. See Snowflake
Knowledge Base article published on February 5th 2019 and titled How To: Recognize Disk
Spilling
4.5. Concurrent queries: If a user experiences a performance issue for a query, one should
find out the queries running concurrently with a problematic query in the same warehouse.
Concurrent queries utilizing the same virtual warehouse can cause the sharing of resources
such as CPU, Memory, and other key resources. This may potentially impact the response
time for concurrent queries.
See the article from Snowflake Knowledge Base How To: Query to find concurrent queries
running on the same warehouse, published on July 27, 2020.
4.6. Outlier queries: Example of outlier queries include those that use more resources than
other queries in the same Snowflake virtual warehouse. See upcoming Snowflake Query
Acceleration Service, in public preview as of the writing date of this article, that would help
with a related solution. You can add QUERY_ACCELERATION service via Snowsight when
creating a compute cluster to make complex queries run faster by injecting additional CPU
horsepower from Snowflake serverless pools while the remaining simpler queries execute
using only the regular nodes that are in your cluster. No need to increase the size of the entire
cluster & waste money just to speed up some of the more complex queries when most other
queries do not need that extra CPU power.
4.7. Queued queries: See Snowflake Knowledge Base article, published on September 9,
2020 How To: Understand Queuing about explaining of the different types of queuing and
what to do when it occurs
4.8. Blocked queries: A blocked query is attempting to acquire a lock on a table or partition
that is already locked by another transaction. Account administrators (ACCOUNTADMIN
role) can view all locks, transactions, and session with: SHOW LOCKS [IN ACCOUNT]. For
all other roles, the function only shows locks across all sessions for the current user.
References:
 Snowflake Knowledge Base article titled How To: Resolve blocked queries and
published on May 18, 2017.
 Blog ‘Transaction Locks in Snowflake’ blog by Sachin Mittal published on March
15, 2022.
4.9. Queries against the metadata layer: A query might taking more time on Metadata
Operations. The Information Schema views are optimized for queries that retrieve a small
subset of objects from the dictionary. Whenever possible, maximize the performance of your
queries by filtering on schema and object names. For more usage information and details, see
the Snowflake Information Schema related documentation.
4.10. Inefficient Pruning: ‘Snowflake collects rich statistics on data allowing it not to read
unnecessary parts of a table based on the query filters. However, for this to have an effect, the
data storage order needs to be correlated with the query filter attributes. The efficiency of
pruning can be observed by comparing Partitions scanned and Partitions total statistics in the
TableScan operators. If the former is a small fraction of the latter, pruning is efficient. If not,
the pruning did not have an effect. Of course, pruning can only help queries that filter out a
significant amount of data. If the pruning statistics do not show data reduction, but there is a
Filter operator above TableScan which filters out a number of records, this might signal that a
different data organization might be beneficial for this query.’ See related Snowflake
documentation on Inefficient pruning and Understanding Snowflake Table Structures. See
also Snowflake Knowledge Base article published on January 24, 2019 and titled How To:
Recognize Unsatisfactory Pruning
4.11. Inefficient data model: An inefficient data model can negatively impact your Snowflake
performance. Often overlooked, optimizing your data model will improve your Snowflake
performance. Snowflake supports various modeling techniques such as Star, Snowflake, Data
Vault and BEAM. Let your usage patterns drive your data model design. Think about how you
foresee your data consumers and business applications leveraging data assets in Snowflake.
4.12 Connected clients issues: You might be connecting to Snowflake using clients that are out
of Snowflake support and not benefiting from fixes, performance and security enhancements,
and new features. You will need to periodically check the versions of your connectors to
Snowflake and upgrade. Snowflake sends a quarterly email on behalf of Support Notifications
titled ‘Quarterly Announcement — End of Support Snowflake Client Drivers: Please Upgrade!’
4.13. Lack of sub-query pruning: Pruning is not happening when using Sub-query. See this
Snowflake Knowledge Base article
4.14. Queries not leveraging results cache: When a query is executed, the result is persisted
(i.e. cached) for a period of time. At the end of the time period, the result is purged from the
system. For persisted query results of all sizes, the cache expires after 24 hours. If a user
repeats a query that has already been run, and the data in the table(s) hasn’t changed since
the last time that the query was run, then the result of the query is the same. Instead of
running the query again, Snowflake simply returns the same result that it returned previously.
This can substantially reduce query time because Snowflake bypasses query execution and,
instead, retrieves the result directly from the cache. To lear more check Using Persisted Query
Results.
4.15. Queries doing a point lookup: You might need to improve the performance of selective
point-lookup queries that returns only one or a small number of distinct rows from a large
table. Search Optimization Service (SOS) is a feature that optimises all supported columns
within a table. It is a table-level property, once added on a table, a maintenance service
creates and populates search access path that are used to perform lookups.
4.16. Queries against very large tables: You might have a very large table and most of your
queries are selecting only the same few columns, have about the same filters and doing
aggregates on the same column. Creating a Materialized View (MV) with these repeating
patterns can greatly help these queries. The idea is that, the Materialized View already holds
the resulting data of these repeating patterns ready for retrieval rather than performing them
over and over against the table.
Over time, as the data changes in a very large table as a result of DML transactions, the
distribution of the data becomes more and more disorganized. Automatic Clustering enables
users to designate a column or a set of columns as the Clustering Key. Snowflake uses this
Clustering Key to reorganize the data so that related records are relocated in the same micro-
partitions enabling more efficient partition pruning. Once a Clustering Key is defined, the
table will be reclustered automatically to maintain optimal data distribution. Automatic
Clustering will then help with queries against large tables that uses range or equality filters on
the Clustering Key.
4.17. Latency issues
4.18. Heavy Scanning
4.19. Slow data load speed: It can be due to many reasons, such as not partitioning staged
data files to avoid scanning files that have already been loaded or not breaking up single large
files into the appropriate size to benefit from Snowflake’s automatic parallel execution.
4.20. Slow ingestion of streaming data
What else would you like to add to the above Snowflake performance
challenges? I much appreciate your comments and feedback.
Please, spread the word, and don’t forget to give this article a ‘Like’ if you find it helpful so
that your LinkedIn buddies can read it and comment on it too. I hope this 2-part blog series
will pave the way to further constructive discussions about Snowflake performance challenges
and solutions for the benefit of all.

Task scheduling and triggered tasks both perform the same function in
Snowflake, but task scheduling can be more resource-intensive and
costly.
Triggered tasks execute when a stream has data. They were originally set to run every minute
with SCHEDULE=’1 minute’ and
WHEN system$stream_has_data(‘my_stream_name’).
Scheduling tasks in this way often results in tasks running constantly, even when performing
no work.
Triggered tasks eliminate the need to schedule tasks to run constantly.
To create a triggered task, omit the SCHEDULE parameter from the task as shown:
Support for this feature is currently not in production and is available only to
selected accounts.

CREATE TASK triggeredTask WAREHOUSE = my_warehouse


WHEN system$stream_has_data(‘my_stream’)
AS
INSERT INTO my_downstream_table
SELECT * FROM my_stream;

Triggered tasks:
Only execute the associated stream when changes are made to the table, and run the stream
shortly after the data change is committed (usually within a few seconds).
Benefits of Triggered Tasks:
Task execution is made simple and automatic through data change triggers. Tasks are
triggered automatically when data is available in the stream, eliminating the need for
schedules. This approach reduces cloud service charges and load, freeing up resources for
other tasks that perform useful work instead of simply polling for work.
Faster triggered tasks can execute every 15 seconds by modifying the
USER_TASK_MINIMUM_TRIGGER_INTERVAL_IN_SECONDS parameter. By default,
triggered tasks execute at most every 30 seconds.
 Once when the task is first resumed, to consume any data already in the stream.
For example, a stream created with the SHOW_INITIAL_ROWS = true.
• Once every 12 hours. if not previously triggered during that time period. If there
is no data in the stream, the run will be skipped, and no compute resources will
be used.
Differences between Triggered Tasks and Scheduled Tasks:
Triggered task differ from scheduled tasks, in that scheduled tasks specify the SCHEDULE
parameter. Triggered tasks generally behave the same way as scheduled tasks in most
respects, with the following exceptions:
In SHOW TASKS / DESC TASK output, the SCHEDULE property displays NULL for
triggered tasks.

If the stream or table that the stream is tracking is dropped (including CREATE OR
REPLACE, which drops it and creates a new one of the same name), the triggered task will no
longer be able to trigger and will automatically suspend. Essentially the result is the same as if
you ran ALTER TASK <task_name> SUSPEND. After the table and/or stream are re-created,
the user can run ALTER TASK <task_name> RESUME to resume triggered processing.
WHEN conditions such as NOT stream_has_data(‘my_stream’) and
stream_has_data(‘my_stream’) = false are not allowed; triggered tasks are only supported on
data changing, and not data remaining the same.AND/OR conditions are allowed, such as
stream_has_data(‘stream’) OR stream_has_data(‘stream2’) will trigger when either of the
streams have data. This can result in a skipped task if an AND condition is used and only one
of the streams has data.

Please follow the steps below for the demo:


Create a master table or target table:-
CREATE OR REPLACE TABLE TASK_TRIGGER_MASTER_TABLE(ID INT,NAME
VARCHAR,ADDRESS VARCHAR,DATA_ENTRY_TIME TIMESTAMP_NTZ)
Create stage table or raw table:-
create or replace TRANSIENT TABLE TASK_MASTER_STAGING (ID INT,NAME
VARCHAR,ADDRESS VARCHAR,DATA_ENTRY_TIME TIMESTAMP_NTZ);
Create stream on top of staging table/raw table:-
create or replace stream TASK_MASTER_STREAM on table TASK_MASTER_STAGING
append_only = true;
First, create task with a schedule time:

CREATE or REPLACE TASK triggeredTask


SCHEDULE = ‘5 minute’
WAREHOUSE = compute_wh
WHEN system$stream_has_data(‘TASK_MASTER_STREAM’)
AS INSERT INTO TASK_TRIGGER_MASTER_TABLE SELECT * FROM
TASK_MASTER_STAGING;

Check task was created or not:


show tasks like ‘triggeredTask%’
After that, if your task is resumed then suspend or directly trigger the second
query:
ALTER TASK triggeredTask SUSPEND;
ALTER TASK triggeredTask UNSET SCHEDULE;
ALTER TASK triggeredTask RESUME;
After that insert the below query:
INSERT INTO TASK_MASTER_STAGING
VALUES(2,’Amulya’,’ANANTAPUR’,current_date())
Within 30 seconds data available in the master table:
select * from TASK_TRIGGER_MASTER_TABLE
select * from TASK_MASTER_STAGING
Triggered Task Limitations:
Serverless tasks (tasks that don’t specify a WAREHOUSE) are not supported.
Streams on views, directory tables, and external tables are not supported.
Replication is not supported.
Thanks for taking the time to read this blog. Let me know if you have any questions or
comments!
If you require a complete code, please send me a message and I will assist you. Let me know
how I can be of help.
Data Migration Developer: Interview Questions and Answers

Suresh Reddy
·
Follow
5 min read
·
Aug 13, 2023

Data migration is a critical process for any organization looking to upgrade its systems or
adopt new technologies. However, it can also be a daunting and complex task that requires
expertise and experience to ensure a smooth transition. As a data migration consultant with
over 8 years of expertise in the field, I have faced numerous challenges and learned valuable
insights along the way.

In this post, I’m going to share some common questions I’ve come across for data migration
consultant interviews to give you an edge. Let’s get started!
Question: Can you recall a tough data migration project you tackled and how you overcame
its challenges?
Answer: Yes, I once worked on a project where we had to migrate data from a legacy system
with minimal documentation. This is a very common problem in most of the legacy migration
projects. The key was to first understand the data structure through reverse engineering and
then create a comprehensive migration plan. Please start regular communication with the
original system’s team/business users and rigorous testing helped us ensure a smooth
transition.
Question: Can you differentiate between data migration and data integration?
Answer: Data migration is about moving data, like when we transitioned from an on-
premises Oracle database to Google’s BigQuery. On the other hand, data integration is about
combining data. For instance, I once integrated real-time sales data from Shopify with
inventory data in an S3 bucket using tools like Kafka and Dell Boomi.
Question: How do you ensure data quality during migration? Any tools you recommend?
Answer: Data quality is paramount and my favourite part of data migration projects. For
projects like Banking migration projects, ensuring accuracy was non-negotiable. I often lean
on data profiling and cleansing before migration. Tools like Talend, Informatica Data Quality,
and Alteryx have been my go-to for such tasks.
Question: Which data migration tools are you familiar with, and which one do you favour?
Answer: I’ve had hands-on experience with AWS DMS, Talend, and Informatica
PowerCenter. For a project involving a cloud-to-cloud migration that…

Data Migration Developer: Interview Questions and Answers

Suresh Reddy
·
Follow
5 min read
·
Aug 13, 2023

Data migration is a critical process for any organization looking to upgrade its systems or
adopt new technologies. However, it can also be a daunting and complex task that requires
expertise and experience to ensure a smooth transition. As a data migration consultant with
over 8 years of expertise in the field, I have faced numerous challenges and learned valuable
insights along the way.
In this post, I’m going to share some common questions I’ve come across for data migration
consultant interviews to give you an edge. Let’s get started!
Question: Can you recall a tough data migration project you tackled and how you overcame
its challenges?
Answer: Yes, I once worked on a project where we had to migrate data from a legacy system
with minimal documentation. This is a very common problem in most of the legacy migration
projects. The key was to first understand the data structure through reverse engineering and
then create a comprehensive migration plan. Please start regular communication with the
original system’s team/business users and rigorous testing helped us ensure a smooth
transition.
Question: Can you differentiate between data migration and data integration?
Answer: Data migration is about moving data, like when we transitioned from an on-
premises Oracle database to Google’s BigQuery. On the other hand, data integration is about
combining data. For instance, I once integrated real-time sales data from Shopify with
inventory data in an S3 bucket using tools like Kafka and Dell Boomi.
Question: How do you ensure data quality during migration? Any tools you recommend?
Answer: Data quality is paramount and my favourite part of data migration projects. For
projects like Banking migration projects, ensuring accuracy was non-negotiable. I often lean
on data profiling and cleansing before migration. Tools like Talend, Informatica Data Quality,
and Alteryx have been my go-to for such tasks.
Question: Which data migration tools are you familiar with, and which one do you favour?
Answer: I’ve had hands-on experience with AWS DMS, Talend, and Informatica
PowerCenter. For a project involving a cloud-to-cloud migration that…

You might also like