You are on page 1of 13

|

DOCUMENTATION
 Community
 Resources
 Blog

o ENGLISH

 Getting Started
 Introduction to Snowflake
 Tutorials, Videos & Other Resources
 Release Notes
 Connecting to Snowflake
 Loading Data into Snowflake
o Overview of Data Loading
o Summary of Data Loading Features
o Data Loading Considerations
 Preparing Your Data Files
 File Sizing Best Practices and Limitations
 Continuous Data Loads (i.e. Snowpipe) and File
Sizing
 Preparing Delimited Text Files
 Semi-structured Data Files and Columnarization
 Numeric Data Guidelines
 Date and Timestamp Data Guidelines
 Planning a Data Load
 Staging Data
 Loading Data
 Managing Regular Data Loads
o Preparing to Load Data
o Bulk Loading Using COPY
o Loading Continuously Using Snowpipe
o Loading Using the Web Interface (Limited)
o Querying Data in Staged Files
o Querying Metadata for Staged Files
o Transforming Data During a Load
o Data Loading Tutorials
 Unloading Data from Snowflake
 Using Snowflake
 Sharing Data Securely in Snowflake
 Managing Your Snowflake Account
 Managing Security in Snowflake
 General Reference
 SQL Command Reference
 SQL Function Reference
 Appendices
NEXTPREVIOUS |
 DOCS »
 
 LOADING DATA INTO
SNOWFLAKE »
 
 DATA LOADING
CONSIDERATIONS »
 
 PREPARING YOUR DATA FILES

Preparing Your Data Files


This topic provides best practices, general guidelines, and important
considerations for preparing your data files for loading.

In this Topic:

 File Sizing Best Practices and


Limitations
o General File Sizing
Recommendations
o Semi-structured Data Size
Limitations
o Parquet Data Size
Limitations
 Continuous Data Loads (i.e.
Snowpipe) and File Sizing
 Preparing Delimited Text Files
 Semi-structured Data Files and
Columnarization
 Numeric Data Guidelines
 Date and Timestamp Data
Guidelines

File Sizing Best Practices and Limitations


For best load performance and to avoid size limitations, consider the following
data file sizing guidelines. Note that these recommendations apply to bulk
data loads as well as continuous loading using Snowpipe.

General File Sizing Recommendations


The number of load operations that run in parallel cannot exceed the number
of data files to be loaded. To optimize the number of parallel operations for a
load, we recommend aiming to produce data files roughly 10 MB to 100 MB in
size compressed. Aggregate smaller files to minimize the processing
overhead for each file. Split larger files into a greater number of smaller files to
distribute the load among the servers in an active warehouse. The number of
data files that are processed in parallel is determined by the number and
capacity of servers in a warehouse. We recommend splitting large files by line
to avoid records that span chunks.

If your source database does not allow you to export data files in smaller
chunks, you can use a third-party utility to split large CSV files.

Linux or macOS
The split utility enables you to split a CSV file into multiple smaller files.

Syntax:

split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]

For more information, type man split in a terminal window.

Example:

split -l 100000 pagecounts-20151201.csv pages

This example splits a file named pagecounts-20151201.csv by line length.


Suppose the large single file is 8 GB in size and contains 10 million lines. Split
by 100,000, each of the 100 smaller files is 80 MB in size (10 million / 100,000
= 100). The split files are named pages<suffix>.

Windows

Windows does not include a native file split utility; however, Windows supports
many third-party tools and scripts that can split large data files.

Semi-structured Data Size Limitations


The VARIANT data type imposes a 16 MB (compressed) size limit on
individual rows.

In general, JSON and Avro data sets are a simple concatenation of multiple
documents. The JSON or Avro output from some software is composed of a
single huge array containing multiple records. There is no need to separate
the documents with line breaks or commas, though both are supported.

Instead, we recommend enabling the STRIP_OUTER_ARRAY file format


option for the COPY INTO <table> command to remove the outer array
structure and load the records into separate table rows:

copy into <table>


from @~/<file>.json
file_format = (type = 'JSON' strip_outer_array = true);

Parquet Data Size Limitations


Currently, data loads of large Parquet files (e.g. greater than 3 GB) could time
out. Split large files into files 1 GB in size (or smaller) for loading.

Continuous Data Loads (i.e. Snowpipe) and


File Sizing
Snowpipe is designed to load new data typically within a minute after a file
notification is sent; however, loading can take significantly longer for really
large files or in cases where an unusual amount of compute resources is
necessary to decompress, decrypt, and transform the new data.

In addition to resource consumption, an overhead to manage files in the


internal load queue is included in the utilization costs charged for Snowpipe.
This overhead increases in relation to the number of files queued for loading.
Snowpipe charges 0.06 credits per 1000 files queued.

For the most efficient and cost-effective load experience with Snowpipe, we
recommend following the file sizing recommendations in File Sizing Best
Practices and Limitations (in this topic). If it takes longer than one minute to
accumulate MBs of data in your source application, consider creating a new
(potentially smaller) data file once per minute. This approach typically leads to
a good balance between cost (i.e. resources spent on Snowpipe queue
management and the actual load) and performance (i.e. load latency).

Creating smaller data files and staging them in cloud storage more often than
once per minute has the following disadvantages:

 A reduction in latency between


staging and loading the data
cannot be guaranteed.
 An overhead to manage files in
the internal load queue is
included in the utilization costs
charged for Snowpipe. This
overhead increases in relation
to the number of files queued
for loading.

Various tools can aggregate and batch data files. One convenient option is
Amazon Kinesis Firehose. Firehose allows defining both the desired file size,
called the buffer size, and the wait interval after which a new file is sent (to
cloud storage in this case), called the buffer interval. For more information,
see the Kinesis Firehose documentation

If your source application typically accumulates enough data within a minute


to populate files larger than the recommended maximum for optimal parallel
processing, you could increase the buffer size. Keeping the buffer interval
setting at 60 seconds (the minimum value) helps avoid creating too many files
or increasing latency.

Preparing Delimited Text Files


Consider the following guidelines when preparing your delimited text (CSV)
files for loading:

 UTF-8 is the default character


set, however, additional
encodings are supported. Use
the ENCODING file format
option to specify the character
set for the data files. For more
information, see CREATE FILE
FORMAT.
 Fields that contain delimiter
characters should be enclosed
in quotes (single or double). If
the data contains single or
double quotes, then those
quotes must be escaped.
 Carriage returns are commonly
introduced on Windows
systems in conjunction with a
line feed character to mark the
end of a line (\r \n). Fields that
contain carriage returns should
also be enclosed in quotes
(single or double).
 The number of columns in each
row should be consistent.

Semi-structured Data Files and


Columnarization
When semi-structured data is inserted into a VARIANT column, Snowflake
extracts as much of the data as possible to a columnar form, based on certain
rules. The rest is stored as a single column in a parsed semi-structured
structure. Currently, elements that have the following characteristics
are not extracted into a column:

 Elements that contain even a


single “null” value are not
extracted into a column. Note
that this applies to elements
with “null” values and not to
elements with missing values,
which are represented in
columnar form.

This rule ensures that


information is not lost, i.e, the
difference between VARIANT
“null” values and SQL NULL
values is not obfuscated.
 Elements that contain multiple
data types. For example:

The foo element in one row


contains a number:

{"foo":1}

The same element in another


row contains a string:

{"foo":"1"}

When a semi-structured element is queried:

 If the element was extracted


into a column, Snowflake’s
execution engine (which is
columnar) scans only the
extracted column.
 If the element was not
extracted into a column, the
execution engine must scan
the entire JSON structure, and
then for each row traverse the
structure to output values,
impacting performance.

To avoid this performance impact:

 Extract semi-structured data


elements containing “null”
values into relational
columns before loading them.

Alternatively, if the “null” values


in your files indicate missing
values and have no other
special meaning, we
recommend setting the file
format
option STRIP_NULL_VALUES
to TRUE when loading the
semi-structured data files. This
option removes object
elements or array elements
containing “null” values.

 Ensure each unique element


stores values of a single native
data type (string or number).

Numeric Data Guidelines


Related Topics
 Numeric Data Types

 Avoid embedded characters,


such as commas (e.g., 123,456).
 If a number includes a
fractional component, it should
be separated from the whole
number portion by a decimal
point (e.g., 123456.789).
 Oracle only. The Oracle
NUMBER or NUMERIC types
allow for arbitrary scale,
meaning they accept values
with decimal components even
if the data type was not defined
with a precision or scale.
Whereas in Snowflake,
columns designed for values
with decimal components must
be defined with a scale to
preserve the decimal portion.

Date and Timestamp Data Guidelines


Related Topics
 Date & Time Data Types

 Date, time, and timestamp data


should be formatted based on
the following components:

Format Description

YYYY Four-digit year.

YY Two-digit year, controlled by


the TWO_DIGIT_CENTURY_START session parameter,
e.g. when set to 1980, values of 79 and 80 parsed
as 2079 and 1980 respectively.

MM Two-digit month (01=January, etc.).

MON Full or abbreviated month name.

DD Two-digit day of month (01 through 31).

DY Abbreviated day of week.

HH24 Two digits for hour (00 through 23); am/pm not allowed.

HH12 Two digits for hour (01 through 12); am/pm allowed.

AM , PM Ante meridiem (am) / post meridiem (pm); for use with HH12.

MI Two digits for minute (00 through 59).

SS Two digits for second (00 through 59).

FF Fractional seconds with precision 0 (seconds) to 9


(nanoseconds), e.g. FF, FF0, FF3, FF9. Specifying FF is
equivalent to FF6 (microseconds).

TZH:TZM , TZHTZM , TZH Time zone hour and minute, offset from UTC. Can be
prefixed by +/- for sign.
 Oracle only. The Oracle DATE
data type can contain
date or timestamp information.
If your Oracle database
includes DATE columns that
also store time-related
information, map these
columns to a TIMESTAMP data
type in Snowflake rather than
DATE.
Note

Snowflake checks temporal data values at load time. Invalid date, time, and
timestamp values (e.g., 0000-00-00) produce an error.

NEXTPREVIOUS |
 ASK THE COMMUNITY
 
 CONTACT SUPPORT
 
 REPORT DOC ISSUE

follow us



visit our blog

 Solutions

o Use Cases

 
o Media & Entertainment

 
o Healthcare

 
o Financial Services

 
o Retail & CPG

 
 Products

o Overview

 
o Why Snowflake

 
o Architecture

 
o Data Warehouse Security

 
o Pricing

 
 Resources

o Resource Library

 
o Support & Services

 
o Documentation

 
o Legal

 
o Community

 
 Explore

o News

 
o Events

 
o Webinars

 
o Blog

 
o Trending

 
 About

o About Snowflake

 
o Partners

 
o Leadership

 
o Snowflake Board

 
o Careers

 
o Contact
450 Concar Drive, San Mateo, CA, 94402, United States| 844-SNOWFLK (844-766-9355)

© 2020 Snowflake Inc. All Rights Reserved | Privacy | Site Terms

You might also like