You are on page 1of 15

Why the company was called Snowflake?

Despite a long tradition of technology companies having non-tech names (for


example Apple, Google and Amazon), Snowflake was not named by a marketing
team. According to the founders, it was named because of their shared love of snow
and skiing.
I was lucky enough to attend a meeting with the founders, where the French born
founder Thierry Cruanes explained in a full French accent how difficult it was to
pronounce the name of his previous company, Oracle. At least now, he joked, people
could understand “Snowflake”.

Snowflake on AWS

Snowflake was first available on Amazon Web Services (AWS), and is a software as a


service platform to load, analyse and report on massive data volumes. Unlike
traditional on-premise solutions which require hardware to be deployed, (potentially
costing millions), snowflake is deployed in the cloud within minutes, and is charged
by the second using a pay-as-you-use model.
It is possible to register and create an account within minutes, which includes $400 of
free credit which is enough to store a terabyte of data, and run an small data
warehouse for nearly two weeks, on a system that will support a small team of
developers.

Snowflake on Azure

In July 2018 Snowflake announced the launch on Microsoft Azure cloud


platform. Essentially the exact same code base as AWS, this means
customers have a choice of cloud platform, which is a significant advantage to
large corporates as it enables a multi-cloud deployment strategy.

Does Snowflake run on Google Cloud?

Absolutely, you can now run Snowflake across all three major cloud platforms, and
indeed seamlessly share or replicate data across platforms. There are also a huge
number of 3rd party tools are available to load data from Google Cloud Platform into
Snowflake on AWS or Azure.

What is Snowflake Data warehouse?


Founded in 2012, Snowflake is a cloud-based data warehouse, founded by three data
warehousing experts. Just six years later, the company raised a massive $450m
venture capital investment, which valued the company at $3.5 billion. But what is
Snowflake, as why is this data warehouse built entirely for the cloud taking the
analytics world by storm?
Although not intended as a Snowflake data warehouse tutorial, this article will
explain what is Snowflake, which platforms does Snowflake support, and the key
aspects of this ground breaking technology.

What makes Snowflake Data Warehouse


unique?
There are many incredible features built in to Snowflake, but the most
remarkable is the ability to spin up an unlimited number of virtual warehouses
(each effectively an independent MPP cluster). This means users can run an
infinite number of independent workloads against the same data without
any risk of contention, as illustrated in the diagram below.

In addition, each warehouse can be resized within milliseconds from a single node
extra-small cluster to a massive 128-node monster. This means, users don’t have to
put up with poor performance, as the machine size can be adjusted throughout the day
to match the workload. In one benchmark test, I reduced the time to process 1.3
terabytes of data down from 5 hours to under 3 minutes.
Finally, in addition to scaling up for larger data volumes, it’s also possible to
automatically scale out to support a massive numbers of users. The diagram below
illustrates how the Snowflake multi-cluster feature automatically scales out and then
back in during the day, and the user is only charged for the time the clusters are
actually running.
What are the layers of the Snowflake
service?
The diagram below illustrates the layers in the Snowflake service:
 The Service Layer:  Provides connectivity to the database and handles
concurrency, transaction management and metadata.
 The Compute Layer:  Hosts a potentially unlimited number of virtual
warehouses, (compute clusters) on which SQL statements are executed.
 The Storage Layer:  Hosts a potentially infinite size data pool.

What SQL does snowflake use?


Snowflake supports a standard set of SQL, a subset of the ANSI standards 1999 and
2003. This means most SQL statements which currently execute against Teradata,
Netezza, Oracle or Microsoft will also execute on Snowflake, often with no changes
needed. Indeed, Snowflake includes a number of extensions to ensure SQL can be
quickly migrated.

Is Snowflake a Data Lake?


The Data Lake architecture became popular as a method of storing massive data
volumes in their raw form, rather than transforming and loading data in a data
warehouse which inevitably leads to selectivity and consequent data loss. This
architecture was traditionally deployed on Hadoop platforms as it often includes semi-
structured and unstructured data which were challenging to handle on traditional
relational platforms.
Unlike legacy data warehouses, Snowflake supports both structured and semi-
structured data including JSON, AVRO and Parquet, and these can be directly queried
using SQL. Unlike Hadoop, Snowflake independently scales compute and storage
resources, and is therefore a far more cost-effective platform for a data lake.
As a result, many customers moving to a cloud-based deployment are implementing
their data lake directly in Snowflake, as it provides a single platform to manage,
transform and analyse massive data volumes. The ability to seamlessly combine JSON
and structured data in a single query is a compelling advantage of Snowflake, and
avoids operating a different platform for the Data Lake and Data Warehouse.
In his excellent article, Tripp Smith explains the benefits of the EPP Snowflake
architecture which can have savings of up to 300:1 on storage compared to Hadoop or
MPP platforms.

Prerequisites about Snowflake


Step 1. Log into SnowSQL
1. Open a terminal window.

2. Start SnowSQL at the command prompt:

$ snowsql -a <account_name> -u <user_name>`


Where: <account_name>  is the name assigned to your account by Snowflake. In the
hostname you received from Snowflake (after your account was provisioned),
your account name is the full/entire string to the left of  snowflakecomputing.com .

 <user_name>  is the login name for your Snowflake user.

Step 2. Create Snowflake Objects


Creating a Database
Create the  sf_tuts  database using the CREATE DATABASE command:
create or replace database sf_tuts;
Note that you do not need to create a schema in the database because each
database created in Snowflake contains a default schema named  public .

Also, note that the database and schema you just created are now in use for your
current session. This information is displayed in your SnowSQL command
prompt, but can also be viewed using the following context functions:

select current_database(), current_schema();

Creating a Table
Create a table named  emp_basic  in  sf_tuts.public  using the CREATE
TABLE command:
create or replace table emp_basic (
first_name string ,
last_name string ,
email string ,
streetaddress string ,
city string ,
start_date date
);

Note that the number of columns in the table, their positions, and their data types
correspond to the fields in the sample CSV data files that you will be staging in
the next step in this tutorial.
Creating a Virtual Warehouse
Create an X-Small warehouse named  sf_tuts_wh  using the CREATE
WAREHOUSE command:
create or replace warehouse sf_tuts_wh with
warehouse_size='X-SMALL'
auto_suspend = 180
auto_resume = true
initially_suspended=true;

Note that the warehouse is not started initially, but it is set to auto-resume, so it
will automatically start running when you execute your first SQL command that
requires compute resources.

Also, note that the warehouse is now in use for your current session. This
information is displayed in your SnowSQL command prompt, but can also be
viewed using the following context function:

Step 4. Copy Data into the Target Table


Execute COPY INTO <table> to load your staged data into the target table.

Note that this command requires an active, running warehouse, which you
created as a prerequisite for this tutorial. If you don’t have access to a
warehouse, you will need to create one now.

copy into emp_basic


from @%emp_basic
file_format = (type = csv field_optionally_enclosed_by='"')
pattern = '.*employees0[1-5].csv.gz'
on_error = 'skip_file';

Let’s look more closely at this command:

 The  FROM  clause identifies the internal stage location.


 FILE_FORMAT  specifies the file type as CSV, and specifies the double-quote
character ( " ) as the character used to enclose strings. Snowflake supports
diverse file types and options. These are described in CREATE FILE
FORMAT. The example COPY statement accepts all other default file
format options.
 PATTERN  applies pattern matching to load data from all files that match the
regular expression  .*employees0[1-5].csv.gz .
 ON_ERROR  specifies what to do when the COPY command encounters errors
in the files. By default, the command stops loading data when the first error
is encountered; however, we’ve instructed it to skip any file containing an
error and move on to loading the next file. Note that this is just for
illustration purposes; none of the files in this tutorial contain errors.
The COPY command also provides an option for validating files before you load
them. See the COPY INTO <table> topic and the other data loading tutorials for
additional error checking and validation instructions.

Step 5. Query the Loaded Data


You can query the data loaded in the  emp_basic  table using standard SQL and any
supported functions and operators .

You can also manipulate the data, such as updating the loaded data or inserting
more data, using standard DML commands.

Query All Data


Return all rows and columns from the table:

select * from emp_basic;

Insert Additional Rows of Data


In addition to loading data from staged files into a table, you can insert rows
directly into a table using the INSERT DML command.

For example, to insert two additional rows into the table:

insert into emp_basic values


('Clementine','Adamou','cadamou@sf_tuts.com','10510 Sachs Road','Klenak','2017-9-22') ,
('Marlowe','De Anesy','madamouc@sf_tuts.co.uk','36768 Northfield Plaza','Fangshan','2017-1-26');
Query Rows Based on Email Address
Return a list of email addresses with United Kingdom domain names using
the LIKE function:
select email from emp_basic where email like '%.uk';

+--------------------------+
| EMAIL |
|--------------------------|
| gbassfordo@sf_tuts.co.uk |
| rtalmadgej@sf_tuts.co.uk |
| madamouc@sf_tuts.co.uk |
+--------------------------+

Query Rows Based on Start Date


Add 90 days to employee start dates using the DATEADD function to calculate
when certain employee benefits might start. Filter the list by employees whose
start date occurred earlier than January 1, 2017:
select first_name, last_name, dateadd('day',90,start_date) from emp_basic where start_date <= '2017-01-01';

+------------+-----------+------------------------------+
| FIRST_NAME | LAST_NAME | DATEADD('DAY',90,START_DATE) |
|------------+-----------+------------------------------|
| Granger | Bassford | 2017-03-30 |
| Catherin | Devereu | 2017-03-17 |
| Cesar | Hovie | 2017-03-21 |
| Wallis | Sizey | 2017-03-30 |
+------------+-----------+------------------------------+

Tutorial Clean Up (Optional)


Execute the following DROP <object> statements to return your system to its
state before you began the tutorial:
drop database if exists sf_tuts;

drop warehouse if exists sf_tuts_wh;

Lifecycle Diagram
All user data in Snowflake is logically represented as tables that can be queried
and modified through standard SQL interfaces. Each table belongs to a schema
which in turn belongs to a database.

Organizing Data
You can organize your data into databases, schemas, and tables. Snowflake
does not limit the number of databases you can create or the number of schemas
you can create within a database. Snowflake also does not limit the number of
tables you can create in a schema.
Storing Data
You can insert data directly into tables. In addition, Snowflake provides DML for
loading data into Snowflake tables from external, formatted files.

Querying Data
Once data is stored in a table, you can issue SELECT statements to query the
data.
Working with Data
Once data is stored in a table, all standard DML operations can be performed on
the data. In addition, Snowflake supports DDL actions such as cloning entire
databases, schemas, and tables.

Removing Data
In addition to using the DML command, DELETE, to remove data from a table,
you can truncate or drop an entire table. You can also drop entire schemas and
databases.

DDL Commands
Data Definition Language (DDL) commands are used to create, manipulate, and
modify objects in Snowflake, such as users, virtual warehouses, databases,
schemas, tables, views, columns, functions, and stored procedures.

They are also used to perform many account-level and session operations, such
as setting parameters, initializing variables, and initiating transactions.

The following commands serve as the base for all DDL commands:

 ALTER <object>
 COMMENT
 CREATE <object>
 DESCRIBE <object>
 DROP <object>
 SHOW <objects>
 USE <object>

Each command takes an object type and identifier. The remaining parameters
and options that can be specified for the command are object-specific.

Account Parameters & Functions


ALTER For setting parameters at the account-level; can only be performed
ACCOUNT by users with the ACCOUNTADMIN role.

SHOW Displays system-defined functions, as well as any user-defined


FUNCTIONS functions.

SHOW For viewing parameter settings for the account.


PARAMETERS

Managed Accounts
CREATE MANAGED Currently used to create reader accounts for providers who
ACCOUNT wish to share data with non-Snowflake customers.
DROP MANAGED
ACCOUNT

SHOW MANAGED
ACCOUNTS

Global Accounts, Database Replication, and


Database Failover
SHOW GLOBAL ACCOUNTS Deprecated. Use SHOW REPLICATION
ACCOUNTS instead.

SHOW REPLICATION
ACCOUNTS

SHOW REPLICATION
DATABASES

SHOW REGIONS

Session Parameters
ALTER SESSION For setting parameters within a session; can be performed by any
user.

SHOW For viewing parameter settings for the session (or account); can
PARAMETERS also be used to view parameter settings for a specified object.

Session Context
USE ROLE Specifies the user role to use in the session.

USE Specifies the virtual warehouse to use in the session.


WAREHOUSE

USE DATABASE Specifies the database to use in the session.

USE SCHEMA Specifies the schema to use in the session (specified schema must
be in the current database for the session).

Session Transactions, SQL Variables, and


Queries
BEGIN For use with multi-statement transactions.
COMMIT For use with multi-statement transactions.

DESCRIBE RESULT Describes the columns in the results from a specified query
(must have been executed within the last 24 hours).

ROLLBACK For use with multi-statement transactions.

SET For defining SQL variables in the session.

SHOW LOCKS For use with multi-statement transactions.

SHOW
TRANSACTIONS

SHOW VARIABLES For showing SQL variables in the session.


UNSET For dropping SQL variables in the session.

You might also like