Professional Documents
Culture Documents
www.sybase.com
TABLE OF CONTENTS
1 Loading Overview.
1 Bulk Loading Methods
2 LOAD TABLE
5 INSERT…LOCATION
6 INSERT…SELECT using Proxy Tables
6 Incremental Loading Methods
9 ETL – Extract, Transform and Load
9 Sybase ETL
11 Stream Loading Methods
11 RAP/CEP
12 Row-based INSERT…VALUES
12 Summary
1. LOADING OVERVIEW
There are many choices for loading data into Sybase IQ. The following picture depicts the choices:
SYBASE IQ
Direct
Binary
ASCII
BCP
LOCATION
INSERT...
Staged
ASCII INSERT...SELECT
ETL RAP/
(serial
mode) RTL CEP
from
ASE
Connect
Direct
Load
Table
Parallel Rep
(optional Server
client side (Non
load) (Sybase) Sybase)
Reading the image from bottom to top, we start with the data sources and move into the various load methods.
The load methods are placed above valid data sources. For example, the “INSERT…LOCATION” method is designed to
load data directly from databases. ETL can load data from either files or databases. The color of the box indicates the
relative speeds of the loading method—red is suitable for smaller tables, yellow is faster, green is very fast, and blue is
the fastest.
RAP/CEP from streams was designed for the low latency requirements of financial services markets. It is the fastest
of the options, but works in a more narrow context than the other methods Your best option for performance is
the “LOAD TABLE” bulk loader, which can handle thousands to millions of rows per second. You will want to run it in
parallel mode, rather than serial mode, for optimum results.
ETL is listed in yellow, because an ETL process generally means that more data processing is done between the raw
data and the database engine. Although “INSERT…LOCATION” is an excellent method for loading IQ, it is not as fast as
“LOAD TABLE” due to the inherent slowness of the data arrival rate through the Open Client connection.
For those environments with load requirements higher than just a few hundred rows per second, the red boxes
should be avoided. “INSERT…VALUES”, and direct trickle loads from Sybase Replication Server are row based operations.
These methods do not take advantage of Sybase IQ’s loading engine, which has been designed for large, bulk loads.
Replication Server combined with staging and Replication Server Real-time Loading option (discussed later), let you
make incremental, close to realtime updates to Sybase IQ, with good performance.
1
2.1 LOAD TABLE
The LOAD TABLE SQL statement imports data from an ASCII or binary formatted file into an existing database table.
An ASCII source file may be in fixed width format (each field occupies a fixed length of bytes) or delimited format
(each field is terminated by a specified delimiter). The LOAD TABLE statement loads both the data and all indexes that
have been created, and does not require any further table reorganizations or index rebuilds. The statement syntax
looks like this:
The various LOAD TABLE options specify the formatting of the input file, how to handle errors, and how to notify the
user of loading progress.
Here is an example of a LOAD TABLE statement that loads ASCII, pipe delimited data from a file on a Windows platform:
2
Loading involves three major steps:
1. Reading the data out of the file and placing it into data structures in memory
2. Building indexes from the data structures in memory
a. For FP, LF, HNG, CMP, DATE, DTTM and TIME indexes, data is inserted directly into the index
b. For HG, WD and text indexes, data is sorted during a first pass, and then inserted during a second pass
3. Writing the data and indexes into the database
The format of the source data file for the LOAD TABLE command may be one of the following:
• ASCII
• BCP
• Binary
ASCII files may be created by any means available to the user. BCP is the format generated by Adaptive Server’s BCP
bulk copy utility. Binary files can be created using the Sybase IQ data extraction facility, or via a user defined program
that follows the binary data format as outlined in the IQ manuals. The data extraction facility lets you redirect the
output of a SELECT statement from the standard input interface to one or more files or named pipes. The format of
the output file is specified by setting various database “TEMP EXTRACT” options prior to executing the SELECT.
For ASCII format, the file may contain either fixed-length fields or variable-length fields terminated by a delimiter.
Binary files contain only fixed-length fields.
LOAD TABLE can operate in either serial or parallel mode, with parallel mode being significantly faster. During
serial mode loading, IQ will allocate a single thread to reading the source data file from disk into memory. Parallel
mode allocates multiple threads for parallel reading of the data from disk into memory. Once in memory, all data is
processed in parallel. The format of the input data file and LOAD TABLE statement dictates whether the load will be
serial or parallel:
The above chart references load options for delimited files in ASCII format. These options define the row and
column delimiters for the source data:
• ROW DELIMITED BY: specifies the character string that marks the end of a record. Each record in the input file
corresponds to a row to be created in Sybase IQ.
• DELIMITED BY: specifies the character string that marks the end of each field in a record. Each field corresponds to
a column specified in the LOAD TABLE statement.
• Either the row or column delimiter can be any valid ASCII character from the standard comma or tab to
unprintable characters line ^K or ^G.
3
Note that even if you have an input data file that should dictate a parallel load, the load might not be done in
parallel for the following reasons:
• You are doing a partial width insert (only a subset of the columns are being loaded into a table)
• You are using the SKIP option to skip a set of rows in the input file, or the LIMIT option to stop reading the input
file after a certain number of rows
• You do not have enough threads configured for IQ (-iqmt server option)
• You do not have enough threads configured for a user (Max_IQ_Threads_Per_Connection and Max_IQ_Threads_
Per_Team options)
You can tell whether a load was done in serial or parallel by looking at the Sybase IQ message file. There will be a
message indicating that “portions of this load may be single threaded” if the load was done in serial.
If you are planning to load a table from multiple files, it is faster to load all files with one LOAD TABLE command,
rather than a separate LOAD TABLE command for each file. LOAD TABLE gives you the option of specifying multiple
input files separated by commas.
LOAD TABLE USING CLIENT FILE supports all load options, including LOB support. It is a true bulk loader and
performance approximates a server side LOAD TABLE plus network latency to transfer the source data over the network.
In a client side load, the client application opens the file and then sends data packets across the network. Each
packet contains a portion of the file. The packets are consumed by the server in memory without recreating the file on
the server side.
Client side loading avoids cluttering up the file system on which the server is running. Furthermore, the average
user may not have any privileges to write a file on the server file system.
To improve security of client side loading, use Transport Layer Security (TLS) to encrypt data across the network.
Additionally, the database administrator can control or disable client side loading with various layered security
mechanisms that are available.
2.1.2 Advantages
• When performed in parallel mode, it is the fastest data load method for IQ (thousands to millions of rows
per second)
• It is very easy to implement with as little as one SQL statement per IQ table
• It can be scheduled at any interval to meet the business needs
• It can capture rejected data and handle it after good data has been loaded
• Data may reside on either the server or the client (with the “client side” option)
2.1.3 Disadvantages
• It supports only new data (changes can be handled via work tables and deletes)
• It can be used only with data files, not external applications or data streams
• There must be software developed to capture source data and write to files or pipes for load
4
2.2 INSERT…LOCATION
INSERT…LOCATION allows Sybase IQ to reach into remote Sybase data sources or non-Sybase data sources (via
Enterprise Connect Data Access, or ECDA), execute a query, and load the result set into an IQ table The syntax is
defined like this:
Here is an example (‘detroit’ is the IQ server name, and ‘iqdb’ is the IQ database name):
With non-Sybase source databases, you need to access the database via the Sybase ECDA product:
Sybase ASA
ASA
Sybase IQ
IQ Sybase
EnterpriseConnect
Remote Servers:
- Oracle
- MS SQL Server
- DB2 UDB
- ODBC
- Mainframe Sources
2.2.1 Advantages
• It is very easy to implement with as little as one SQL statement per IQ table
• It can be scheduled at any interval to meet the business needs
• It is only as slow as the source system and network transmission
2.2.2 Disadvantages
• You must know how to detect inserts, updates, and deletes at the source system
• Inserts and updates must be converted to deletes followed by inserts for best performance
• INSERT…LOCATION can be used only with databases, not external applications or data streams
5
2.3 INSERT…SELECT using Proxy Tables
Sybase IQ includes SQL Anywhere as a component of its architecture and uses it for some of its operations. One of
the features of SQL Anywhere is the ability to access data in external sources—databases, spreadsheets and ODBC
data sources—as though that data were local. This is accomplished through Component Integration Services (CIS).
To have remote tables appear as local tables to the client, you create local proxy tables that map to the remote data.
(You will need to install the ECDA gateway to access non-Sybase databases.) To create a proxy table:
1. Define the type and location of the server where the remote data is located.
2. Map the local user login information to the remote server user login information.
3. Create the proxy table definition. This includes the server where the remote table is located, the database name,
table owner name, table and column names.
Once you have defined a proxy table, you can then use an INSERT statement with a SELECT clause to insert data
from the remote database into a permanent table in your Sybase IQ database.
2.3.1 Advantages
• Can be used to access a variety of data sources – databases, spreadsheets and ODBC data sources
• Location transparency – gives you a uniform view of local and remote tables
2.3.2 Disadvantages
• You must know how to detect inserts, updates, and deletes at the source system
• You must create proxy tables to map to the remote tables
• CIS must go through the SQL Anywhere layer within Sybase IQ to execute SQL commands against proxy tables
This causes it to be slower than commands that execute purely within the Sybase IQ engine. (Note that CIS has
undergone performance improvements in Sybase IQ 15.2, so that less of the processing must occur within
SQL Anywhere.)
• Non-intrusive transaction capture of database changes from heterogeneous sources via the transaction logs
• Flexible transformation of data
• Efficient routing of changes across networks
• Near real-time delivery to and synchronization with replicate heterogeneous targets (databases and message buses)
With Sybase Replication, a “Replication Agent” (there is a different agent for each supported source database
platform) reads the transaction logs from primary databases and sends the data changes across the network to
“Replication Server”. Replication Server accepts the information from the agent, and applies the changes to Sybase
IQ. You can choose to replicate any changes to the entire source database (database-level replication), or you can
selectively choose tables and columns to replicate (table-level replication).
There are three ways to configure a replication environment, depending on your performance requirements, and
your source database platform:
1. Direct replication: Replication Server connects directly to Sybase IQ and applies the transactions in order to the
target IQ system.
2. Replication with Staging: Replication Server queues the source data changes for delayed bulk loading into IQ
with higher performance.
3. Replication Server Real-Time Loading Option: If your source database is Sybase ASE 15.0.3 or version 15.5 and
later, and your source database schema is the same as your Sybase IQ schema (no transformation required),
Replication Server can replicate to Sybase IQ using Sybase IQ bulk loading for high performance.
6
Each of these scenarios is described below.
3.1.1.1 Advantages
• It is very simple to set up
• You can use database-level rather than table-level replication to minimize complexity
• No custom code or scripts are needed
• All architecture can be designed with Sybase’s PowerDesigner modeling tool
3.1.1.2 Disadvantages
• All data is applied to IQ in OLTP, single-row format
• Performance is slow: 1 – 100 rows per second
ASE IQ
ASA
Data is continuously replicated into the staging database via custom function strings so that all inserts, updates
and deletes are captured. The before and after image of each data row is maintained so that the following can be
accomplished in IQ:
• Insert/Update: an attempt is made to delete the row should it exist, and the new data is copied in
• Delete: the original data in IQ is simply deleted
7
At scheduled intervals, the data is moved into IQ via the bulk loader:
• (Insert/Update/Delete): the primary key of the before image will be copied using INSERT…LOCATION into a work
table in IQ, and then a delete performed by joining the work table with the permanent table on the primary key
• (Insert/Update): the after image will be copied into the IQ permanent table using INSERT…LOCATION from the
staging database
Once the data changes have been fully applied from the staging database to Sybase IQ, the data in the staging
database is cleaned out.
3.1.2.1 Advantages
• Operates with better performance than direct replication
• All architecture can be designed with PowerDesigner
• You have full control over what data (tables, columns, rows) is moved into IQ
• You can utilize Replication server function strings to perform data cleansing and transformations
3.1.2.2 Disadvantages
• This scenario requires a staging database: ASA or ASE
• You must custom code three function strings for each table being replicated
• You must custom write data movement logic from the staging area to IQ, and create scripts to pause Replication
Server so that loading of the current batch can complete, and the staging area can be cleaned up
• Compilation: rearranges replicate data by grouping it by table and operation (insert, update and delete), and then
compiles the changes into bulk operations to be performed in the next step
• Bulk apply – applies the net result of compilation to IQ using IQ bulk loading commands. Replication Server uses
an in-memory database to store the changes to apply to IQ.
You can control the amount of data that is grouped together for bulk apply by adjusting the group size:
Continuous
In memory Grouped bulk
Capture of
database loads
Changed Data
ASE IQ
8
3.1.3.1 Advantages
• You have full control over what data (server-level, database-level, table-level) is moved into IQ
• Reduced number of external components (no staging database)
• Reduced latency without the overhead of either the staging solution, or the row-based performance of the direct
replication solution
• Simpler maintenance and manageability – the configuration is straightforward compared to the staging solution,
and does not require function string mapping, replication suspend and resume, and data population from staging
area to Sybase IQ.
3.1.3.2 Disadvantages
• The only source database platform that is supported is Sybase ASE
• Data transformation is not supported, and the source and target schemas must be equivalent
• Sybase ETL Development: a graphical tool for designing data transformation projects. This tool also provides a
debugging environment. You design a project by placing data source, data transformation and data destination
components on a canvas and linking them together.
• Sybase ETL Server is a distributed grid engine, which executes the ETL projects designed using the Sybase ETL
Development tool.
There is also an ETL Scheduler to create and execute scheduled jobs and a Runtime Manager to monitor job
execution and performance.
Sybase ETL can extract data from files, databases, and Sybase Replication (for incremental updates), transform it and
load it into Sybase IQ:
ETL IQ IQ
Flat Files
ETL Server
IQ IQ
Oracle, Server Database
IBM DB2, Metadata
Relational Sybase, Repository
Source ODBC,
Databases OLE DB
Replication
Transaction
log changes
Sybase,
Oracle Replication Replication
Databases Agent Server
only
9
4.1.1 ETL Data Sources
This table lists all the data sources that ETL can extract data from:
Loader Description
IQ Loader File Use this component to load or upsert (replace existing and insert new) data
from a file into a target IQ database using the LOAD TABLE statement. This
component supports client-side loading as well as server-side loading.
IQ Loader DB Use this component to load or upsert (replace existing and insert new)
data from a source database into a target IQ database using the INSERT
LOCATION statement. Works with Sybase databases only, but can load from
heterogeneous sources through Sybase ECDA.
10
4.1.4 Advantages
• Includes a wide variety of transformation components (supports JavaScript for user defined transformations)
• It is tuned to perform well with Sybase IQ by invoking its bulk loading methods
• It is integrated with Replication Server to replicate continuous real-time updates from a source Sybase or Oracle
database to Sybase IQ
• Supports client-side loading of flat files into Sybase IQ 15.0 and above, so flat files don’t need to be transferred via
FTP to the Sybase IQ server machine. (Note that client side loading is not supported for named pipes.)
• Sybase ETL supports multiple writers to separate tables for parallel loading into Sybase IQ. It can be configured to
utilize multiple writer nodes in a Sybase IQ Multiplex.
4.1.5 Disadvantages
• ETL does not support LOB data processing or transfer.
• Replication Server can pass replicate data to Sybase ETL from Sybase and Oracle databases only.
5.1 RAP/CEP
Sybase RAP is an analytics platform for capital markets:
Sybase
CEP Studio
Internal
Systems Spreadsheets/
Visualization
Application
Transactions Reporting
Market
Data Feeds
SYBASE RAP
Risk Managers
Subscriber RAPCache
Subscriber RAPStore
It consolidates market data from vendor feeds, historical time series data, real-time trades and quotes, and reference
data to an in-memory cache database (RAPCache) and a historical data store (RAPStore). RAPCache is an Adaptive
Server Enterprise (ASE) database; RAPStore is a Sybase IQ database.
Sybase CEP (Complex Event Processing) is an event processing engine that continuously executes queries against
incoming data streams. Its input and output adapters convert external data into Sybase CEP streams and vice versa.
Sybase CEP is integrated with Sybase RAP to enable the storage of market, trade, pre-analytics and other data in
real time.
11
Feed handlers direct inbound market data into the RAP environment. They manage connectivity and message
transformation directly from exchanges, like the NYSE, or consolidated service providers, like Reuters. RAP provides two
sample feed handlers: the FAST feed handler and the demo feed handler. All other feed handlers are currently third
party. Developers create custom feed handlers using an API and other development tools.
Consumers—automated trading applications and various user communities with analytic needs—can direct
queries requiring real-time data to RAPCache, and queries requiring historical or aggregated data to RAPStore. These
databases can be accessed from Java applications (JDBC), direct querying (ISQL), C++ applications (ODBC, CT-Library)
and .Net applications (ADO.Net).
5.1.1 Advantages
• Very fast with proven load and query benchmarks
• Flexible - the customer can pick and choose components of the architecture based on business needs
• Very quick development lifecycle to build feed handlers
• Includes a wide array of time-series, statistical and OLAP functions for sophisticated analytics
5.1.2 Disadvantages
• Must custom write inbound data feed handlers
• Does not work directly with database sources or files – expects data streams
6. ROW-BASED INSERT…VALUES
INSERT…VALUES can be used to insert individual rows. A commit should be performed at the end of the sequence of
inserts to end the transaction.
6.1 Advantages
• Simple, familiar SQL
6.2 Disadvantages
• INSERT…VALUES is a row-based operation in Sybase IQ, and is slow, because the IQ load process has been geared
towards large, bulk loads – not single-row or small batches.
7. SUMMARY
This document has covered the various load methods available to you with Sybase IQ. The methods vary by data
source type – files (including named pipes) and databases – and by functionality and performance. Based on your
particular needs, you should be able to find a method that works well for you.
Sybase, Inc.
Worldwide Headquarters
One Sybase Drive
Dublin, CA 94568-7902
U.S.A
1 800 8 sybase
Copyright © 2010 Sybase, Inc. All rights reserved. Unpublished rights reserved under U.S. copyright laws. Sybase, the
Sybase logo, Adaptive Server, Replication Server and SQL Anywhere are trademarks of Sybase, Inc. or its subsidiaries.
All other trademarks are the property of their respective owners. ® indicates registration in the United States.
www.sybase.com Specifications are subject to change without notice. 05/10