Professional Documents
Culture Documents
We will discuss data loading ‐ in other words, how to get data into an
Oracle database. The main focus of the chapter will be the SQL*LOADER tool (or
SQLLDR,
pronounced ʹsequel loaderʹ), as it is still the predominant method for loading
data.
However, we will look at a few other options along the way and will also briefly
explore
how to get data out of the database.
SQLLDR has been around for as long as I can remember, with various minor
enhancements over time, but is still the subject of some confusion. My remit in
this chapter
is not to be comprehensive, but rather to tackle the issues that I see people
encounter every
day when using this tool. The chapter has a ʹquestion and answerʹ format. I will
present a
frequently encountered requirement or issue, and explore possible solutions. In the
process weʹll cover many of the practicalities of using the SQLLDR tool and of data
loading in general:
• Loading delimited and fixed format data.
• Loading dates.
• Loading data using sequences, in which we take a look at the addition of the CASE
statement in SQL, added in Oracle 8.1.6.
• How to do an upsert (update if data exists, insert otherwise).
• Loading data with embedded newlines, where we rely on the use of some new
features and options, such as the FIX, VAR, and STR attributes, added in Oracle
8.1.6.
• Loading LOBs using the new data types, BLOB and CLOB, introduced in Oracle8.0,
which support a much deeper functionality than the legacy LONG and LONG
RAW types.
We will not cover the direct path loader mode in any detail and nor will we cover
topics
such as using SQLLDR for data warehousing, parallel loads, and so on. These topics
could
fill an entire book by themselves.
An Introduction to SQL*LOADER
SQL*LOADER (SQLLDR) is Oracleʹs high speed, bulk data loader. It is an extremely
useful
tool, used to get data into an Oracle database from a variety of flat file formats.
SQLLDR
can be used to load enormous amounts of data in an amazingly short period of time.
It has
two modes of operation:
• Conventional path ‐ SQLLDR will employ SQL inserts on our behalf to load data.
• Direct path ‐ Does not use SQL. Formats database blocks directly.
The direct path load allows you to read data from a flat file, and write it
directly to
formatted database blocks, bypassing the entire SQL engine (and rollback and redo
at the
same time). When used in parallel, direct path load is the fastest way from no data
to a
fully loaded database, and canʹt be beaten.
We will not cover every single aspect of SQLLDR. For all of the details, the Oracle
Server
Utilities Guide dedicates six chapters to SQLLDR. The fact that it is six chapters
is notable,
since every other utility gets one chapter or less. For complete syntax and all of
the options,
I will point you to this reference guide ‐ this chapter is intended to answer the
ʹHow do I?ʹ
questions that a reference guide does not address.
It should be noted that the OCI (Oracle Call Interface for C) allows you to write
your own
direct path loader using C, with Oracle8.1.6 release 1 and onwards as well. This is
useful
when the operation you want to perform is not feasible in SQLLDR, or when seamless
integration with your application is desired. SQLLDR is a command line tool ‐ a
separate
program. It is not an API or anything that can be ʹcalled from PL/SQLʹ, for
example.
If you execute SQLLDR from the command line with no inputs, it gives you the
following
help:
$ sqlldr
Valid Keywords:
Parameter Meaning
allowing you to specify that only records meeting a
specified criteria are loaded.
DISCARDMAX Specifies the maximum number of discarded records
permitted in a load. If you exceed this value, the load will
terminate. By default, all records may be discarded
without terminating a load.
ERRORS The maximum number of errors encountered by
SQLLDR that are permitted before the load terminates.
These errors can be caused by many things, such as
conversion errors (by trying to load ABC into a number
field for example), duplicate records in a unique index,
and so on. By default, 50 errors are permitted, and then
the load will terminate. In order to allow all valid records
to be loaded in that particular session (with the rejected
records going to the BAD file), specify a large number
such as 999999999.
FILE When using the direct path load option in parallel, you
may use this to tell SQLLDR exactly which database data
file to load into. You would use this to reduce contention
for the database data files during a parallel load, to
ensure each loader session is writing to a separate device.
LOAD The maximum number of records to load. Typically used
to load a sample of a large data file, or used in
conjunction with SKIP to load a specific range of records
from an input file.
LOG Used to name the LOG file. By default, SQLLDR will
create a LOG file named after the CONTROL file in the
same fashion as the BAD file.
PARALLEL Will be TRUE or FALSE. When TRUE, it signifies you are
doing a parallel direct path load. This is not necessary
when using a conventional path load ‐ they may be done
in parallel without setting this parameter.
PARFILE Can be used to specify the name of a file that contains all
of these KEYWORD=VALUE pairs. This is used instead
of specifying them all on the command line.
READSIZE Specifies the size of the buffer used to read input data.
ROWS The number of rows SQLLDR should insert between
commits in a conventional path load. In a direct path
Parameter Meaning
load, this is the number of rows to be loaded before
performing a data save (similar to a commit). In a
conventional path load, the default is 64 rows. In a direct
path load, the default is to not perform a data save until
the load is complete.
SILENT Suppresses various informational messages at run‐time.
SKIP Used to tell SQLLDR to skip x number of records in the
input file. Most commonly used to resume an aborted
load (skipping the records that have been already
loaded), or to only load one part of an input file.
USERID The USERNAME/PASSWORD@DATABASE connect
string. Used to authenticate to the database.
SKIP_INDEX_MAINTENANCE Does not apply to conventional path loads ‐ all indexes
are always maintained in this mode. In a direct path load,
this tells Oracle not to maintain indexes by marking them
as unusable. These indexes must be rebuilt after the load.
SKIP_UNUSABLE_INDEXES Tells SQLLDR to allow rows to be loaded into a table that
has unusable indexes, as long as the indexes are not
unique indexes.
In order to use SQLLDR, you will need a control file. A control file simply
contains
information describing the input data ‐ its layout, datatypes, and so on, as well
as
information about what table(s) it should be loaded into. The control file can even
contain
the data to load. In the following example, we build up a simple control file in a
step‐by‐
step fashion, with an explanation of the commands at each stage:
LOAD DATA
LOAD DATA ‐ This tells SQLLDR what to do, in this case load data. The other thing
SQLLDR can do is CONTINUE_LOAD, to resume a load. We would use this option only
when continuing a multi‐table direct path load.
INFILE *
INFILE * ‐ This tells SQLLDR the data to be loaded is actually contained within the
control
file itself (see below). Alternatively you could specify the name of another file
that
contains the data. We can override this INFILE statement using a command line
parameter
if we wish. Be aware that command line options override control file settings, as
we shall see in
the Caveats section.
Table created.
469
DEPTNO FIRST * , CHARACTER
DNAME NEXT * , CHARACTER
LOC NEXT * , CHARACTER
Table DEPT:
4 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.
fixed format file would probably be the most recognized file format, but on UNIX
and NT,
delimited files are the norm. In this section, we will investigate the popular
options used
to load delimited data.
The most popular format for delimited data is the CSV format where CSV stands for
comma‐separated values. In this file format, where each field of data is separated
from the
next by a comma, text strings can be enclosed within quotes, thus allowing for the
string
itself to contain a comma. If the string must contain a quotation mark as well, the
convention is to double up the quotation mark (in the following code we use ʺʺ in
place of
just a ʺ).
A typical control file to load delimited data will look much like this:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
FIELDS TERMINATED BY ʹ,ʹ OPTIONALLY ENCLOSED BY ʹʺʹ
(DEPTNO,
DNAME,
LOC
)
BEGINDATA
10,Sales,ʺʺʺUSAʺʺʺ
20,Accounting,ʺVirginia,USAʺ
30,Consulting,Virginia
40,Finance,Virginia
50,ʺFinanceʺ,ʺʺ,Virginia
60,ʺFinanceʺ,,Virginia
The following line performs the bulk of the work:
FIELDS TERMINATED BY ʹ,ʹ OPTIONALLY ENCLOSED BY ʹʺʹ
It specifies that a comma separates the data fields, and that each field might be
enclosed in
double quotes. When we run SQLLDR using this control file, the results will be:
tkyte@TKYTE816> select * from dept;
40 Finance Virginia
50 Finance
60 Finance
6 rows selected.
Notice the following in particular:
• ʺUSAʺ – This resulted from input data that was ʺʺʺUSAʺʺʺ. SQLLDR counted the
double occurrence of ʺ as a single occurrence within the enclosed string. In order
to
load a string that contains the optional enclosure character, you must ensure they
are doubled up.
• Virginia,USA in department 20 – This results from input data that was
ʺVirginia,USAʺ. This input data field had to be enclosed in quotes to retain the
comma as part of the data. Otherwise, the comma would have been treated as the
end of field marker, and Virginia would have been loaded without the USA text.
• Departments 50 and 60 were loaded with Null location fields. When data is
missing,
you can choose to enclose it or not, the effect is the same.
Another popular format is tab‐delimited data: data that is separated by tabs rather
than
commas. There are two ways to load this data using the TERMINATED BY clause:
• TERMINATED BY Xʹ09ʹ, which is the tab character using hexadecimal format
(ASCII 9 is a tab character), or you might use
• TERMINATED BY WHITESPACE
The two are very different in implementation however, as the following shows. Using
the
DEPT table from above weʹll load using this control file:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
FIELDS TERMINATED BY WHITESPACE
(DEPTNO,
DNAME,
LOC)
BEGINDATA
10 Sales Virginia
It is not readily visible on the page, but there are two tabs between each piece of
data in the
above. The data line is really:
10\t\tSales\t\tVirginia
Where the \t is the universally recognized tab escape sequence. When you use this
control
file with the TERMINATED BY WHITESPACE clause as above, the resulting data in the
table DEPT is:
tkyte@TKYTE816> select * from dept;
position format is generally the best approach. The downside to a fixed width file
is, of
course, that it can be much larger than a simple, delimited file format.
To load fixed position data, you will use the POSITION keyword in the control file.
For
example:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
( DEPTNO position(1:2),
DNAME position(3:16),
LOC position(17:29)
)
BEGINDATA
10Accounting Virginia,USA
This control file does not employ the FIELDS TERMINATED BY clause, rather it uses
POSITION to tell SQLLDR where fields begin and end. Of interest to note with the
POSITION clause is that we could use overlapping positions, and go back and forth
in the
record. For example if we were to alter the DEPT table as follows:
tkyte@TKYTE816> alter table dept add entire_line varchar(29);
Table altered.
And then use the following control file:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
( DEPTNO position(1:2),
DNAME position(3:16),
LOC position(17:29),
ENTIRE_LINE position(1:29)
)
BEGINDATA
10Accounting Virginia,USA
The field ENTIRE_LINE is defined as position(1:29) – it extracts its data from all
29 bytes
of input data, whereas the other fields are substrings of the input data. The
outcome of the
above control file will be:
Table altered.
We can load it with the following control file:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
FIELDS TERMINATED BY ʹ,ʹ
(DEPTNO,
DNAME,
LOC,
LAST_UPDATED date ʹdd/mm/yyyyʹ
)
BEGINDATA
10,Sales,Virginia,1/5/2000
20,Accounting,Virginia,21/6/1999
30,Consulting,Virginia,5/1/2000
40,Finance,Virginia,15/3/2001