Professional Documents
Culture Documents
Streamsets & Snowflake Part 1
Streamsets & Snowflake Part 1
TGNCVGF"CTVKENGU
UVTGCOUGVU"("UPQYHNCMG<"RCTV"3
04 Swkem"Vkru"hqt"Wukpi"Upqyhncmg"ykv
VNFT= j"CYU"Ncodfc"*1u1ctvkeng1Swkem/
3.63K
We recommend waiting; read on to see why and a workaround that may be acceptable while it is being Vkru/hqt/Wukpi/Upqyhncmg/ykvj/C
perfected. YU/Ncodfc+
Note that the connector is an enterprise connector and is not covered under the Apache 2.0 license that
Streamsets is released under as open source.
05 Upqyhncmg"EK1EF"wukpi"Hn{yc{"cpf
The Streamsets Snowflake connector can be installed and configured following the instructions here C|wtg"FgxQru"Rkrgnkpg"/"Rctv"3"*1u1
(https://streamsets.com/documentation/controlhub/latest/help/datacollector/UserGuide/Destinations
ctvkeng1Upqyhncmg/EK/EF/wukpi/Hn{ 8.67K
yc{/cpf/C|wtg/FgxQru/Rkrgnkpg/
/Snowflake.html).
Rctv/3+
Please refer to Streamsets official documentation on the various ways to install or ‘spin up’ a Streamsets
instance. We have opted for the latest (as of testing) Docker image for testing purposes.
VTGPFKPI"CTVKENGU
Wug"Ecug
01 Ecejkpi"kp"Upqyhncmg"Fcvc"Yctgjqwug
We will keep the use case simple here and address the loading of a WebSocket data source into Snowflake
*1u1ctvkeng1Ecejkpi/kp/Upqyhncmg/Fcvc/
in a ‘micro-batch’ or near real-time fashion. The data is not expected to arrive in real-time but an SLA ~1 Yctgjqwug+
minute is acceptable.
For simplicity, we have chosen to use a public API. The input data source will be the CoinCap API
02 Jqy"Vq<"Uwdokv"c"Uwrrqtv"Ecug
*1u1ctvkeng1Jqy/Vq/Uwdokv/c/Uwrrqtv/
(https://docs.coincap.io/?version=latest), and it will integrate their trades WebSocket Ecug/kp/Upqyhncmg/Nqfig+
which streams trades from the Binance exchange. Since this data flows at a rapid pace, it is appropriate for
04 Jqy"Vq<"NCVGTCN"HNCVVGP"cpf"LUQP
us to be able to test the performance of the connector in different scenarios.
Vwvqtkcn
*1u1ctvkeng1Jqy/Vq/Ncvgtcn/Lqkp/Vwvqtkcn+
Eqr{"Dcugf"GN
05 Dgjcxkqt"Ejcpig"Nqi
The COPY approach to ingestion is the most basic approach to ingesting data from an external source and *1u1ctvkeng1Rgpfkpi/Dgjcxkqt/Ejcpig/Nqi+
placing it in a Snowflake table. On Streamsets side, this requires configuring:
Connection to Snowflake
Target table in Snowflake
Staging table for Snowflake
Uvtgcougvu"Eqphkiwtcvkqp
The first step is to set up a connection to Snowflake. This requires knowledge of the Account name, a user
name (e.g. name of service account created for this ingestion process) and the password for that account.
Also, since Snowflake makes extensive use of RBAC roles, it may be necessary to specify the Role as a
connection property if the default role for the user is not set to the one required for this ingestion
pipeline.
Snowflake uses staging tables that currently can be internal (managed by Snowflake), S3 in AWS, BLOB in
Azure or GCS in GCP.
We will be using an internal stage for simplicity.
The only required parameter here is the Snowflake Stage Name. The Database and Schema are optional.
For consistency and to remove any ambiguity, we would recommend thoroughly specifying them here as
well.
The final step is to specify the Snowflake target table. Although this table does not need to exist a priori, it
is best to define the schema beforehand to make sure that the data type mappings are correct and error
out when issues arise. To configure this part of the connector, it is necessary to specify:
Virtual Warehouse
Database
Schema
Table
Upqyhncmg"Eqphkiwtcvkqp
Configuring Snowflake will require both the creation of the table and the internal stage. We assume below
that the warehouse, database, and schema have already been created and owned (or have usage/create )
by a role called DB_OWNER.
Buildup — the creation of resource/objects
// Setup environment
use role DB_OWNER;
use database TEST_DB;
use schema TEST;
use warehosue TEST_WH;// Create Objects
create or replace stage TEST_STG
file_format=(type=csv);// Create table
create or replace table TEST_TBL
(
EXCHANGE varchar(64),
BASE varchar(64),
QUOTE varchar(64),
DIRECTION varchar(64),
PRICE float,
VOLUME float,
TIMESTAMP number(38,0),
PRICEUSD float
);
// Setup environment
use role DB_OWNER;
use database TEST_DB;
use schema TEST;
use warehosue TEST_WH;// Cleanup objects
drop table if exists TEST_TBL;
drop stage if exists TEST_STAGE;
Upqyrkrg/Dcugf"GN
The configuration for Snowpipe based ingestion is very similar to that for the COPY based ingestion, at
least as far as Streamsets is concerned. For Snowflake the creation (and cleanup scripts) for a Snowpipe
owning role and provision granting must be written.
Uvtgcougvu"Eqphkiwtcvkqp
For Snowpipe, you would select the ‘Use Snowpipe’ option. But be careful, the ‘Table Auto Create’ must be
deselected — if it isn’t, then an exception is thrown at runtime or during pipeline validation.
Snowflake configuration tab
NOTE: It is a better idea to use a credential store (Azure Key Vault, Hashicorp Vault, etc.) to store these
secrets! However, for the simplicity of this demo, we are keeping it simple.
Snowpipe configuration tab
Upqyhncmg"Eqphkiwtcvkqp
Similar to the resource creation in the COPY scenario, we will provide a demo script here that shows a
possible way to create resources for Snowpipe based continual ingestion. In this, a role is created that
gives rights of usage to the Snowpipe (as specified in Snowflake documentation). A role with specific
SecurityAdmin type rights is needed for this if you prefer not to use SecurityAdmin role to do some of this
work. This role is called SPECIAL_POWER_ROLE below.
We note one specific deviation from the canonical ‘grants’ that are required by Snowflake for a pipe. Since
Streamsets creates a temporary file format, the role must also be granted ‘create file format’ on the
schema.
Buildup — the creation of resource/objects
Dcugnkpg"Rgthqtocpeg"Cpcn{uku
To gauge the performance of the connector writing directly to Snowflake, we test each connector for
approximately 4 minutes. We didn’t watch the wall clock but the results should demonstrate enough to
show some concern. We discuss an alternative solution a bit further down the page.
For the Copy version of the ingestion pipeline, we saw roughly 100 messages ingested.
For Snowpipe, the ingestion was a little better — approximately 66% better throughput.
The disparity in the performance leads us to believe there are some issues with the connector.
Furthermore, when seeing this variability, it begs the question: how much data actually should be coming
through the pipeline? See the answer below!
Kortqxkpi"Rgthqtocpeg
To improve performance, we asked: what would happen if we put wrote to a local file
instead? Furthermore, if we treat this as a local staging location, then can we improve the throughput of
the Snowflake connector? Underlying this assumption is that the connector is processing data in batch
sizes that are too small — and that creating micro-batches of data in ‘files’ will improve the throughput.
Uvtgcougvu"Eqphkiwtcvkqp
For this test, we simplify the Local FS configuration to write all files into a single, non-datestamped, folder
called /tmp/out/crypto and specify no file suffix. The file names will be ‘sdc-UUID’. We have also specified
that the Max Records in File to 100. This can be configured to test different rates of throughput.
The ingestion will use the Directory connector. The critical configuration parameters are the File Name
Pattern Mode is ‘sdc-*’ as a Glob-type parameter and Read Order is ‘Last Modified Timestamp’.
Rgthqtocpeg"Tguwnvu
Holy moly! This is very different than what we saw with the direct use of the connector.
Even after deduping the ingested data, which only had a volume penalty of about 8%, this result is
spectacularly different.
Performance metrics for the FS to Snowpipe ingestion
Gttqt"Tgeqtfu
3/24/2020 7:14 AM
These are errors that occur when the ‘PRICEUSD’ column was missing from the JSON packet. This may be
Streamsets
a gap Getting
in the Data Drift configuration but has little Startedto
relevance With
thisSnowflake
evaluation in general.
(/s/topic/0TO0Z000000UoMcWAK/… (/s/topic/0TO0Z000000wmFQWAY…
Uqog"Cffkvkqpcn"Kpukijvu
This was tested with Streamsets 3.10.1 and 3.13.0 with the Streamsets 1.2.0 and 1.3.0 Snowflake
Enterprise Library. The same issue is present in both cases. But if you think about it, it is not that the
connector is ill-performing; it is missing features that are a necessity when working with Snowflake.
S
n
o
(https://twitter.com TGUQWTEGU RTQFWEVU CDQWV
w
!
/SnowflakeDB) First Name
fl Documentation Overview About Snowflake
(https://www.linkedin.com
fl" (https://docs.snowflake.net/)
(https://www.snowflake.com
(https://www.snowflake.com
/company
a /product/) /about/)
/3653845) Educational Services Last Name
(https://www.youtube.com
# k (https://www.snowflake.com)
/user/snowflakecomputing) (/s/education- Architecture Team
© 2020 - Snowflake Inc. All Rights Reserved
e services) (https://www.snowflake.com
(https://www.snowflake.com
Privacy (https://www.snowflake.com/privacy-policy/) ·
(https://www.facebook.com /product /about/snowflake-
$
r
/Snowflake-
Site Terms (https://www.snowflake.com/terms-of-
Snowflake University Email Address
service/) · Program Terms (/s/data-heroes-terms/) /architecture/) team/)
Computing-
e (https://snowflakeuniversity.mindtickle.com)
709171695819345/) Security Board
q Get Started with
(https://www.snowflake.com
(https://www.snowflake.com Country *
u Snowflake (/s/get-
/product/security/) /about/board/)
started)
i
Pricing Careers
r Get Started in the
(https://www.snowflake.com
(https://www.snowflake.com Subscribe to Community News
Snowflake
e /product/pricing/) /about/careers/)
Community
s
(/s/community-help)
d
Knowledge Base
a
(/s/topiccatalog)
t
a
t
o
b
e
w
r
i
t
t
e
n
t
o
e
x
t
e
r
n
a
l
s
t
a
g
e
s
a
n
d
t
h
e
n
l
o
a
d
e
d
f
r
o
m
t
h
o
s
e
e
x
t
e
r
n
a
l
s
t
a
g
e
s
.
T
h
i
s
p
a
t
t
e
r
n
g
e
n
e
r
a
l
l
y
d
e
s
i
r
e
s
d
a
t
a
t
o
b
e
m
o
v
e
d
i
n
t
o
a
s
t
a
g
e
i
n
b
a
t
c
h
e
s
.
T
h
e
d
e
f
a
u
l
t
b
e
h
a
v
i
o
r
,
w
h
i
c
h
d
o
e
s
n
o
t
s
e
e
m
t
o
b
e
c
o
n
fi
fi
g
u
r
a
b
l
e
,
i
s
t
o
s
e
n
d
e
a
c
h
‘
m
e
s
s
a
g
e
’
b
a
t
c
h
r
e
c
e
i
v
e
d
t
o
S
n
o
w
fl
fl
a
k
e
a
s
a
m
i
c
r
o
-
b
a
t
c
h
.
W
h
i
l
e
t
h
e
r
e
i
s
n
o
i
n
h
e
r
e
n
t
p
r
o
b
l
e
m
w
i
t
h
m
i
c
r
o
-
b
a
t
c
h
i
n
g
i
n
S
n
o
w
fl
fl
a
k
e
,
a
n
d
i
n
f
a
c
t
i
s
d
o
n
e
s
u
c
c
e
s
s
f
u
l
l
y
w
i
t
h
v
a
r
i
o
u
s
d
a
t
a
a
c
q
u
i
s
i
t
i
o
n
t
o
o
l
s
,
i
t
d
o
e
s
n
o
t
s
e
e
m
t
o
b
e
h
a
v
e
d
e
s
i
r
a
b
l
y
i
n
t
h
i
s
c
a
s
e
.
A
c
l
e
a
r
i
m
p
r
o
v
e
m
e
n
t
w
o
u
l
d
b
e
t
o
e
n
a
b
l
e
a
f
o
r
m
o
f
b
a
t
c
h
i
n
g
c
o
n
t
r
o
l
i
n
t
h
e
S
n
o
w
fl
fl
a
k
e
c
o
n
n
e
c
t
o
r
t
o
i
m
p
r
o
v
e
t
h
i
s
p
e
r
f
o
r
m
a
n
c
e
h
i
n
d
r
a
n
c
e
.
E
w
t
t
g
p
v
T
g
e
q
o
o
g
p
f
c
v
k
q
p
u
A
t
t
h
i
s
t
i
m
e
w
e
c
a
n
n
o
t
r
e
c
o
m
m
e
n
d
u
s
i
n
g
t
h
e
S
n
o
w
fl
fl
a
k
e
c
o
n
n
e
c
t
o
r
f
o
r
t
h
e
i
n
g
e
s
t
i
o
n
o
f
s
t
r
e
a
m
i
n
g
d
a
t
a
d
u
e
t
o
a
l
a
c
k
o
f
s
u
p
p
o
r
t
f
o
r
m
i
c
r
o
-
b
a
t
c
h
i
n
g
o
f
i
n
c
o
m
i
n
g
d
a
t
a
s
t
r
e
a
m
s
.
W
h
i
l
e
i
t
m
a
y
p
e
r
f
o
r
m
w
e
l
l
i
n
s
o
m
e
s
i
t
u
a
t
i
o
n
s
,
i
t
w
o
u
l
d
b
e
a
d
v
i
s
a
b
l
e
t
o
d
o
u
b
l
e
-
c
h
e
c
k
t
h
e
r
e
n
o
t
b
e
i
n
g
a
b
e
t
t
e
r
m
o
r
e
p
e
r
f
o
r
m
a
n
t
p
i
p
e
l
i
n
e
c
o
n
fi
fi
g
u
r
a
t
i
o
n
,
s
u
c
h
a
s
p
r
e
s
e
n
t
e
d
a
b
o
v
e
.
W
e
d
o
r
e
c
o
g
n
i
z
e
t
h
a
t
w
r
i
t
i
n
g
d
a
t
a
t
o
a
l
o
c
a
l
F
S
,
e
s
p
e
c
i
a
l
l
y
i
n
a
D
o
c
k
e
r
c
o
n
t
a
i
n
e
r
t
h
a
t
c
a
n
b
e
h
e
a
v
i
l
y
r
e
s
o
u
r
c
e
-
c
o
n
s
t
r
a
i
n
e
d
i
f
h
o
s
t
e
d
i
n
a
K
u
b
e
r
n
e
t
e
s
c
l
u
s
t
e
r
,
w
a
s
u
s
e
f
u
l
i
n
d
e
m
o
n
s
t
r
a
t
i
n
g
s
h
o
r
t
-
c
o
m
i
n
g
s
i
n
t
h
e
c
o
n
n
e
c
t
o
r
'
s
f
u
n
c
t
i
o
n
a
l
i
t
y
.
P
g
g
f
J
g
n
r
A
I
f
y
o
u
w
o
u
l
d
l
i
k
e
h
e
l
p
w
i
t
h
S
n
o
w
fl
fl
a
k
e
,
S
t
r
e
a
m
s
e
t
s
,
o
r
m
a
k
i
n
g
s
u
r
e
y
o
u
r
d
a
t
a
s
o
l
u
t
i
o
n
i
s
p
e
r
f
o
r
m
a
n
t
a
n
d
p
u
r
p
o
s
e
l
y
-
s
u
i
t
e
d
t
o
y
o
u
r
u
s
e
c
a
s
e
s
,
t
h
e
n
p
l
e
a
s
e
c
o
n
t
a
c
t
u
s
.
H
a
s
h
m
a
p
(
h
t
t
p
s
:
/
/
w
w
w
.
h
a
s
h
m
a
p
i
n
c
.
c
o
m
/
)
o
f
f
e
r
s
a
r
a
n
g
e
o
f
e
n
a
b
l
e
m
e
n
t
w
o
r
k
s
h
o
p
s
a
n
d
a
s
s
e
s
s
m
e
n
t
s
e
r
v
i
c
e
s
,
c
l
o
u
d
m
i
g
r
a
t
i
o
n
s
e
r
v
i
c
e
s
,
a
n
d
c
o
n
s
u
l
t
i
n
g
s
e
r
v
i
c
e
p
a
c
k
a
g
e
s
a
s
p
a
r
t
o
f
o
u
r
C
l
o
u
d
a
n
d
S
n
o
w
fl
fl
a
k
e
s
e
r
v
i
c
e
o
f
f
e
r
i
n
g
s
—
w
e
w
o
u
l
d
b
e
g
l
a
d
t
o
w
o
r
k
t
h
r
o
u
g
h
y
o
u
r
s
p
e
c
i
fi
fi
c
r
e
q
u
i
r
e
m
e
n
t
s
.
F
e
e
l
f
r
e
e
t
o
s
h
a
r
e
o
n
o
t
h
e
r
c
h
a
n
n
e
l
s
a
n
d
b
e
s
u
r
e
a
n
d
k
e
e
p
u
p
w
i
t
h
a
l
l
n
e
w
c
o
n
t
e
n
t
f
r
o
m
H
a
s
h
m
a
p
b
y
f
o
l
l
o
w
i
n
g
o
u
r
E
n
g
i
n
e
e
r
i
n
g
a
n
d
T
e
c
h
n
o
l
o
g
y
B
l
o
g
(
h
t
t
p
s
:
/
/
m
e
d
i
u
m
.
c
o
m
/
h
a
s
h
m
a
p
i
n
c
)
a
n
d
s
u
b
s
c
r
i
b
i
n
g
t
o
o
u
r
I
o
T
o
n
T
a
p
p
o
d
c
a
s
t
(
h
t
t
p
s
:
/
/
s
o
u
n
d
c
l
o
u
d
.
c
o
m
/
u
s
e
r
-
6
4
6
3
7
0
1
2
6
-
4
4
9
3
0
6
3
6
1
)
.
J
o
h
n
A
v
e
n
,
P
h
D
,
i
s
t
h
e
L
e
a
d
R
e
g
i
o
n
a
l
T
e
c
h
n
i
c
a
l
E
x
p
e
r
t
a
t
H
a
s
h
m
a
p
(
h
t
t
p
:
/
/
h
a
s
h
m
a
p
i
n
c
.
c
o
m
/
)
p
r
o
v
i
d
i
n
g
D
a
t
a
,
C
l
o
u
d
,
I
o
T
,
a
n
d
A
I
/
M
L
s
o
l
u
t
i
o
n
s
a
n
d
c
o
n
s
u
l
t
i
n
g
e
x
p
e
r
t
i
s
e
a
c
r
o
s
s
i
n
d
u
s
t
r
i
e
s
w
i
t
h
a
g
r
o
u
p
o
f
i
n
n
o
v
a
t
i
v
e
t
e
c
h
n
o
l
o
g
i
s
t
s
a
n
d
d
o
m
a
i
n
e
x
p
e
r
t
s
a
c
c
e
l
e
r
a
t
i
n
g
h
i
g
h
-
v
a
l
u
e
b
u
s
i
n
e
s
s
o
u
t
c
o
m
e
s
f
o
r
o
u
r
c
u
s
t
o
m
e
r
s
.
B
e
s
u
r
e
a
n
d
c
o
n
n
e
c
t
w
i
t
h
J
o
h
n
o
n
L
i
n
k
e
d
I
n
(
h
t
t
p
s
:
/
/
w
w
w
.
l
i
n
k
e
d
i
n
.
c
o
m
/
i
n
/
j
o
h
n
-
a
v
e
n
-
p
h
d
-
8
1
5
a
8
0
7
4
?
l
i
p
i
=
u
r
n
%
3
A
l
i
%
3
A
p
a
g
e
%
3
A
d
_
fl
fl
a
g
s
h
i
p
3
_
p
r
o
fi
fi
l
e
_
v
i
e
w
_
b
a
s
e
_
c
o
n
t
a
c
t
_
d
e
t
a
i
l
s
%
3
B
M
h
6
F
k
X
0
r
T
I
K
U
G
p
1
B
%
2
F
R
n
6
S
A
%
3
D
%
3
D
)
a
n
d
r
e
a
c
h
o
u
t
f
o
r
m
o
r
e
p
e
r
s
p
e
c
t
i
v
e
s
a
n
d
i
n
s
i
g
h
t
i
n
t
o
a
c
c
e
l
e
r
a
t
i
n
g
y
o
u
r
d
a
t
a
-
d
r
i
v
e
n
b
u
s
i
n
e
s
s
o
u
t
c
o
m
e
s
.