You are on page 1of 7

What is the Snowflake Data Warehouse?

Jun 3 
Written By John Ryan

(6 minute read)

Founded in 2012, Snowflake is a cloud-based datawarehouse, founded by three data


warehousing experts. Just six years later, the company raised a massive $450m venture
capital investment, which valued the company at $3.5 billion. But what is Snowflake, as why
is this data warehouse built entirely for the cloud taking the analytics world by storm?

Although not intended as a Snowflake data warehouse tutorial, this article will explain what
is Snowflake, which platforms does Snowflake support, and the key aspects of this ground
breaking technology.

Snowflake: Multi-Cloud Data Platform


Snowflake was first available on Amazon Web Services (AWS), and is a software as a
service platform to load, analyse and report on massive data volumes. Unlike traditional on-
premise solutions which require hardware to be deployed, (potentially costing millions),
snowflake is deployed in the cloud within minutes, and is charged by the second using a pay-
as-you-use model.

It is possible to register and create an account within minutes, which includes $400 of free
credit which is enough to store a terabyte of data, and run an small data warehouse for nearly
two weeks, on a system that will support a small team of developers.

In July 2018 Snowflake announced the launch on Microsoft Azure cloud platform.


Essentially the exact same code base as AWS, this means customers have a choice of cloud
platform, which is a significant advantage to large corporates as it enables a multi-cloud
deployment strategy.

How does Snowflake Work?


There are many incredible features built in to Snowflake, but the most remarkable is the
ability to spin up an unlimited number of virtual warehouses (each effectively an independent
MPP cluster). This means users can run an infinite number of independent workloads against
the same data without any risk of contention, as illustrated in the diagram below.
In addition, each warehouse can be resized within milliseconds from a single node extra-
small cluster to a massive 128-node monster. This means, users don’t have to put up with
poor performance, as the machine size can be adjusted throughout the day to match the
workload. In one benchmark test, I reduced the time to process 1.3 terabytes of data down
from 5 hours to under 3 minutes.

Finally, in addition to scaling up for larger data volumes, it’s also possible to automatically
scale out to support a massive numbers of users. The diagram below illustrates how the
Snowflake multi-cluster feature automatically scales out and then back in during the day, and
the user is only charged for the time the clusters are actually running.
Is Snowflake an MPP database?
MPP stands for Massively Parallel Processing, and is a database architecture successfully
deployed by Teradata and Netezza. Unlike traditional Symmetric Multi-Processing (SMP)
hardware which runs a number of CPUs in a single machine, the MPP architecture deploys a
cluster of independently running machines, with data distributed across the system. In
addition to the ability to handle massive data volumes, this means it supports a scale out
architecture, as additional nodes can be added to the cluster, although this can take from
hours to days to deploy.

EPP stands for Elastic Parallel Processing, and was pioneered by Snowflake Computing. This
uses a number of independently running MPP clusters connected to a shared data pool. This
architecture has the advantage that new clusters can be started within seconds, to elastically
grow or shrink resources as needed.

What are the three layers of Snowflake architecture?


The diagram below illustrates the layers in the Snowflake service:

1. Cloud Service Layer:  Is “the brains” of the operation. This provides connectivity to the
database and handles infrastructure, transaction management, SQL performance optimisation,
security and metadata.

2. Compute Services Layer:  Hosts a potentially unlimited number of virtual warehouses


whereby each warehouse consistent of a cluster of database servers which executes SQL
statements. Although the virtual warehouse consistent of CPUs, memory and SSD storage,
this is purely a transient storage layer.
3. Cloud Storage Layer:  Provides an infinite pool of permanent data storage. All data is
stored in the cloud storage and is automatically replicated to three separate data centres with
provides a built in layer of disaster recovery.

Three Layers of Snowflake Database Architecture

The layers of the architecture work transparently to service end user SQL queries, although it
is possible to start and suspend virtual warehouses manually.

How much does a Snowflake credit cost?


Snowflake compute resources are charged at a rate of $0.00056 per second for a credit on an
on-demand Standard Edition platform. This works out at around $2.00 per hour for an extra-
small virtual warehouse on AWS Europe. Snowflake only charges for compute time while the
virtual server is running, and is applied on a per-second basis after the first 60 seconds.

Storage is charged separately as a pass-through cost from the underlying provider, and on
AWS works out at around $23 per terabyte per month. This means it’s possible to store a 10
Terabyte data warehouse for around $230 per month. In reality, as Snowflake applies
columnar compression on the data, it’s likely that storage will work out much cheaper on
Snowflake than (for example) S3.

What SQL does snowflake use?


Snowflake supports a standard set of SQL, a subset of the ANSI standards 1999 and 2003.
This means most SQL statements which currently execute against Teradata, Netezza, Oracle
or Microsoft will also execute on Snowflake, often with no changes needed. Indeed,
Snowflake includes a number of extensions to ensure SQL can be quickly migrated.

Is Snowflake a Data Lake?


The Data Lake architecture became popular as a method of storing massive data volumes in
their raw form, rather than transforming and loading data in a data warehouse which
inevitably leads to selectivity and consequent data loss. This architecture was traditionally
deployed on Hadoop platforms as it often includes semi-structured and unstructured data
which were challenging to handle on traditional relational platforms.

Unlike legacy data warehouses, Snowflake supports both structured and semi-structured data
including JSON, AVRO and Parquet, and these can be directly queried using SQL. Unlike
Hadoop, Snowflake independently scales compute and storage resources, and is therefore a
far more cost-effective platform for a data lake.

As a result, many customers moving to a cloud-based deployment are implementing their


data lake directly in Snowflake, as it provides a single platform to manage, transform and
analyse massive data volumes. The ability to seamlessly combine JSON and structured data
in a single query is a compelling advantage of Snowflake, and avoids operating a different
platform for the Data Lake and Data Warehouse.

In his excellent article, Tripp Smith explains the benefits of the EPP Snowflake architecture
which can have savings of up to 300:1 on storage compared to Hadoop or MPP platforms.

Why was the company called Snowflake?


Despite a long tradition of technology companies having non-tech names (for example Apple,
Google and Amazon), Snowflake was not named by a marketing team. According to
the founders, it was named because of their shared love of snow and skiing.

I was lucky enough to attend a meeting with the founders, where the French born
founder Thierry Cruanesexplained in a full French accent how difficult it was to pronounce
the name of his previous company, Oracle. At least now, he joked, people could understand
“Snowflake”.

Snowflake data warehouse pros and cons

The advantages of cloud based data warehousing have been extensively reviewed. The main
advantages of Snowflake over traditional on-premise bases solutions are:-

 Machine Size:  Is no longer an issue.  Unlike traditional systems which typically


involve deploying a massive server with plans to upgrade a few years down the line,
Snowflake can be deployed on a single extra-small cluster, and scaled up and down as
needed.
 Disk Space:  Is no longer an issue.  As data storage from cloud providers is both
inexpensive and practically infinite in size.
 Security: Is baked in to the system.  Snowflake includes a huge array of security
features including IP whitelisting, multi-factor authentication and AES 256 strong end-
to-end encryption.
 Disaster Recovery:  Is no longer an issue.  As data is automatically replicated across
three availability zones, and can withstand the loss of any two data centres.
 Software Upgrades:  Are no longer required.  As Snowflake is provided as a
software service, both operating system and database upgrades are silently and
transparently applied.
 Performance:  Is no longer an issue, as clusters can be resized on-the-fly to deal with
unexpectedly high data volumes.
 Concurrency:  Is no longer an issue, as each cluster can also be configured to
automatically scale out to satisfy massive numbers of users, then scale back when no
longer needed.
 Tuning and Maintenance:  Is no longer an issue, as Snowflake supports no indexes,
and aside from a few well documented best practices, there is no need to tune the
database.  Built for simplicity, there's little requirement for DBA resources.

In terms of the disadvantages, there is not much to write out. Customers on legacy Oracle,
Netezza, Teradata or IBM platforms will need to migrate to Snowflake, and this should be
considered as part of an overall cloud strategy, otherwise there's no significant drawbacks for
a data warehouse platform.

Notice Anything Missing?


No annoying pop-ups or adverts. No bull, just facts, insights and opinions. Sign up below and
I will ping you a mail when new content is available. I will never spam you or abuse your
trust. Alternatively, you can leave a comment below.

Disclaimer: The opinions expressed on this site are entirely my own, and will not necessarily
reflect those of my employer.

Email *
Submit

John Ryan
Comments (0)
Newest First
Newest First                                                                

You might also like