You are on page 1of 12

DATA LOADING

We will discuss data loading ‐ in other words, how to get data into an
Oracle database. The main focus of the chapter will be the SQL*LOADER tool (or
SQLLDR,
pronounced ʹsequel loaderʹ), as it is still the predominant method for loading
data.
However, we will look at a few other options along the way and will also briefly
explore
how to get data out of the database.
SQLLDR has been around for as long as I can remember, with various minor
enhancements over time, but is still the subject of some confusion. My remit in
this chapter
is not to be comprehensive, but rather to tackle the issues that I see people
encounter every
day when using this tool. The chapter has a ʹquestion and answerʹ format. I will
present a
frequently encountered requirement or issue, and explore possible solutions. In the
process weʹll cover many of the practicalities of using the SQLLDR tool and of data
loading in general:
• Loading delimited and fixed format data.
• Loading dates.
• Loading data using sequences, in which we take a look at the addition of the CASE
statement in SQL, added in Oracle 8.1.6.
• How to do an upsert (update if data exists, insert otherwise).
• Loading data with embedded newlines, where we rely on the use of some new
features and options, such as the FIX, VAR, and STR attributes, added in Oracle
8.1.6.
• Loading LOBs using the new data types, BLOB and CLOB, introduced in Oracle8.0,
which support a much deeper functionality than the legacy LONG and LONG
RAW types.
We will not cover the direct path loader mode in any detail and nor will we cover
topics
such as using SQLLDR for data warehousing, parallel loads, and so on. These topics
could
fill an entire book by themselves.
An Introduction to SQL*LOADER
SQL*LOADER (SQLLDR) is Oracleʹs high speed, bulk data loader. It is an extremely
useful
tool, used to get data into an Oracle database from a variety of flat file formats.
SQLLDR
can be used to load enormous amounts of data in an amazingly short period of time.
It has
two modes of operation:
• Conventional path ‐ SQLLDR will employ SQL inserts on our behalf to load data.

• Direct path ‐ Does not use SQL. Formats database blocks directly.
The direct path load allows you to read data from a flat file, and write it
directly to
formatted database blocks, bypassing the entire SQL engine (and rollback and redo
at the
same time). When used in parallel, direct path load is the fastest way from no data
to a
fully loaded database, and canʹt be beaten.
We will not cover every single aspect of SQLLDR. For all of the details, the Oracle
Server
Utilities Guide dedicates six chapters to SQLLDR. The fact that it is six chapters
is notable,
since every other utility gets one chapter or less. For complete syntax and all of
the options,
I will point you to this reference guide ‐ this chapter is intended to answer the
ʹHow do I?ʹ
questions that a reference guide does not address.
It should be noted that the OCI (Oracle Call Interface for C) allows you to write
your own
direct path loader using C, with Oracle8.1.6 release 1 and onwards as well. This is
useful
when the operation you want to perform is not feasible in SQLLDR, or when seamless
integration with your application is desired. SQLLDR is a command line tool ‐ a
separate
program. It is not an API or anything that can be ʹcalled from PL/SQLʹ, for
example.
If you execute SQLLDR from the command line with no inputs, it gives you the
following
help:
$ sqlldr

SQLLDR: Release 8.1.6.1.0 ‐ Production on Sun Sep 17 12:02:59 2000


(c) Copyright 1999 Oracle Corporation. All rights reserved.

Usage: SQLLOAD keyword=value [,keyword=value,...]

Valid Keywords:

userid ‐‐ ORACLE username/password


control ‐‐ Control file name
log ‐‐ Log file name
bad ‐‐ Bad file name
data ‐‐ Data file name
discard ‐‐ Discard file name
discardmax ‐‐ Number of discards to allow (Default all)
skip ‐‐ Number of logical records to skip (Default 0)
load ‐‐ Number of logical records to load (Default all)
errors ‐‐ Number of errors to allow (Default 50)
rows ‐‐ Number of rows in conventional path bind array or between
direct path data saves

(Default: Conventional path 64, Direct path all)


bindsize ‐‐ Size of conventional path bind array in bytes (Default 65536)
silent ‐‐ Suppress messages during run (header, feedback, errors,
discards, partitions)
direct ‐‐ use direct path (Default FALSE)
parfile ‐‐ parameter file: name of file that contains parameter
specifications
parallel ‐‐ do parallel load (Default FALSE)
file ‐‐ File to allocate extents from
skip_unusable_indexes ‐‐ disallow/allow unusable indexes or index partitions
(Default FALSE)
skip_index_maintenance ‐‐ do not maintain indexes, mark affected indexes as
unusable (Default FALSE)
commit_discontinued ‐‐ commit loaded rows when load is discontinued
(Default FALSE)
readsize ‐‐ Size of Read buffer (Default 1048576)
We will quickly go over the meaning of these parameters in the following table:
Parameter Meaning
BAD The name of a file that will contain rejected records at the
end of the load. If you do not specify a name for this, the
BAD file will be named after the CONTROL file (see later
in the chapter for more details on control files), we use to
load with. For example, if you use a CONTROL file
named foo.ctl, the BAD file will default to foo.bad, which
SQLLDR will write to (or overwrite if it already exists).
BINDSIZE The size in bytes of the buffer used by SQLLDR to insert
data in the conventional path loader. It is not used in a
direct path load. It is used to size the array with which
SQLLDR will insert data.
CONTROL The name of a CONTROL file, which describes to
SQLLDR how the input data is formatted, and how to
load it into a table. You need a CONTROL file for every
SQLLDR execution.
DATA The name of the input file from which to read the data.
DIRECT Valid values are True and False, with the default being
False. By default, SQLLDR will use the conventional path
load method.
DISCARD The name of a file to write records that are not to be
loaded. SQLLDR can be used to filter input records,

Parameter Meaning
allowing you to specify that only records meeting a
specified criteria are loaded.
DISCARDMAX Specifies the maximum number of discarded records
permitted in a load. If you exceed this value, the load will
terminate. By default, all records may be discarded
without terminating a load.
ERRORS The maximum number of errors encountered by
SQLLDR that are permitted before the load terminates.
These errors can be caused by many things, such as
conversion errors (by trying to load ABC into a number
field for example), duplicate records in a unique index,
and so on. By default, 50 errors are permitted, and then
the load will terminate. In order to allow all valid records
to be loaded in that particular session (with the rejected
records going to the BAD file), specify a large number
such as 999999999.
FILE When using the direct path load option in parallel, you
may use this to tell SQLLDR exactly which database data
file to load into. You would use this to reduce contention
for the database data files during a parallel load, to
ensure each loader session is writing to a separate device.
LOAD The maximum number of records to load. Typically used
to load a sample of a large data file, or used in
conjunction with SKIP to load a specific range of records
from an input file.
LOG Used to name the LOG file. By default, SQLLDR will
create a LOG file named after the CONTROL file in the
same fashion as the BAD file.
PARALLEL Will be TRUE or FALSE. When TRUE, it signifies you are
doing a parallel direct path load. This is not necessary
when using a conventional path load ‐ they may be done
in parallel without setting this parameter.
PARFILE Can be used to specify the name of a file that contains all
of these KEYWORD=VALUE pairs. This is used instead
of specifying them all on the command line.
READSIZE Specifies the size of the buffer used to read input data.
ROWS The number of rows SQLLDR should insert between
commits in a conventional path load. In a direct path

Parameter Meaning
load, this is the number of rows to be loaded before
performing a data save (similar to a commit). In a
conventional path load, the default is 64 rows. In a direct
path load, the default is to not perform a data save until
the load is complete.
SILENT Suppresses various informational messages at run‐time.
SKIP Used to tell SQLLDR to skip x number of records in the
input file. Most commonly used to resume an aborted
load (skipping the records that have been already
loaded), or to only load one part of an input file.
USERID The USERNAME/PASSWORD@DATABASE connect
string. Used to authenticate to the database.
SKIP_INDEX_MAINTENANCE Does not apply to conventional path loads ‐ all indexes
are always maintained in this mode. In a direct path load,
this tells Oracle not to maintain indexes by marking them
as unusable. These indexes must be rebuilt after the load.
SKIP_UNUSABLE_INDEXES Tells SQLLDR to allow rows to be loaded into a table that
has unusable indexes, as long as the indexes are not
unique indexes.
In order to use SQLLDR, you will need a control file. A control file simply
contains
information describing the input data ‐ its layout, datatypes, and so on, as well
as
information about what table(s) it should be loaded into. The control file can even
contain
the data to load. In the following example, we build up a simple control file in a
step‐by‐
step fashion, with an explanation of the commands at each stage:
LOAD DATA
LOAD DATA ‐ This tells SQLLDR what to do, in this case load data. The other thing
SQLLDR can do is CONTINUE_LOAD, to resume a load. We would use this option only
when continuing a multi‐table direct path load.
INFILE *
INFILE * ‐ This tells SQLLDR the data to be loaded is actually contained within the
control
file itself (see below). Alternatively you could specify the name of another file
that
contains the data. We can override this INFILE statement using a command line
parameter
if we wish. Be aware that command line options override control file settings, as
we shall see in
the Caveats section.

INTO TABLE DEPT


INTO TABLE DEPT ‐ This tells SQLLDR to which table we are loading data, in this
case
the DEPT table.
FIELDS TERMINATED BY ʹ,ʹ
FIELDS TERMINATED BY ʹ,ʹ ‐ This tells SQLLDR that the data will be in the form of
comma‐separated values. There are dozens of ways to describe the input data to
SQLLDR,
this is just one of the more common methods.
(DEPTNO,
DNAME,
LOC
)
(DEPTNO, DNAME, LOC) ‐ This tells SQLLDR what columns we are loading, their order
in the input data, and their data types. The data types are of the data in the
input stream,
not the data types in the database. In this case, they are defaulting to CHAR(255),
which is
sufficient.
BEGINDATA
BEGINDATA ‐ This tells SQLLDR we have finished describing the input data, and that
the
very next line will be the actual data to be loaded into the DEPT table:
10,Sales,Virginia
20,Accounting,Virginia
30,Consulting,Virginia
40,Finance,Virginia
So, this is a control file in one of its most simple and common formats ‐ to load
delimited
data into a table. We will take a look at some much complex examples in this
chapter, but
this a good one to get our feet wet with. To use this control file, all we need to
do is create
an empty DEPT table:
tkyte@TKYTE816> create table dept
2 ( deptno number(2) constraint emp_pk primary key,
3 dname varchar2(14),
4 loc varchar2(13)
5 )
6 /

Table created.

and run the following command:


C:\sqlldr>sqlldr userid=tkyte/tkyte control=demo1.ctl

SQLLDR: Release 8.1.6.0.0 ‐ Production on Sat Apr 14 10:54:56 2001


(c) Copyright 1999 Oracle Corporation. All rights reserved.
Commit point reached ‐ logical record count 4
If the table is not empty, you will receive an error message to the following
effect:
SQLLDR‐601: For INSERT option, table must be empty. Error on table DEPT
This is because we allowed almost everything in the control file to default, and
the default
load option is INSERT (as opposed to APPEND, TRUNCATE, or REPLACE). In order to
INSERT, SQLLDR assumes the table is empty. If we wanted to add records to the DEPT
table, we could have specified APPEND, or to replace the data in the DEPT table, we
could
have used REPLACE or TRUNCATE.
Every load will generate a log file. The log file from our simple load looks like
this:
SQLLDR: Release 8.1.6.0.0 ‐ Production on Sat Apr 14 10:58:02 2001

(c) Copyright 1999 Oracle Corporation. All rights reserved.


Control File: demo1.ctl
Data File: demo1.ctl
Bad File: demo1.bad
Discard File: none specified

(Allow all discards)

Number to load: ALL


Number to skip: 0
Errors allowed: 50
Bind array: 64 rows, maximum of 65536 bytes
Continuation: none specified
Path used: Conventional

Table DEPT, loaded from every logical record.


Insert option in effect for this table: INSERT

Column Name Position Len Term Encl Datatype


‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐ ‐‐‐‐ ‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐

469
DEPTNO FIRST * , CHARACTER
DNAME NEXT * , CHARACTER
LOC NEXT * , CHARACTER

Table DEPT:
4 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.

Space allocated for bind array: 49536 bytes(64 rows)


Space allocated for memory besides bind array: 0 bytes

Total logical records skipped: 0


Total logical records read: 4
Total logical records rejected: 0
Total logical records discarded: 0

Run began on Sat Apr 14 10:58:02 2001


Run ended on Sat Apr 14 10:58:02 2001

Elapsed time was: 00:00:00.11


CPU time was: 00:00:00.04
These log files tell us about many of the aspects of our load. We can see the
options we
used (defaulted or otherwise). We can see how many records were read, how many
loaded, and so on. It specifies the locations of all BAD and DISCARD files. It even
tells us
how long it took. These log files are crucial for verifying that the load was
successful, as
well as for diagnosing errors. If the loaded data resulted in SQL errors (the input
data was
ʹbadʹ, and created records in the BAD file), these errors would be recorded here.
The
information in the log file is largely self‐explanatory, so we will not spend any
more time
on it.
How to ...
We will now cover what I have found to be the most frequently asked questions with
regards to loading and unloading data in an Oracle database, using SQLLDR.
Load Delimited Data
Delimited data, data that is separated by some special character, and perhaps
enclosed in
quotes, is the most popular data format for flat files today. On a mainframe, a
fixed length,

fixed format file would probably be the most recognized file format, but on UNIX
and NT,
delimited files are the norm. In this section, we will investigate the popular
options used
to load delimited data.
The most popular format for delimited data is the CSV format where CSV stands for
comma‐separated values. In this file format, where each field of data is separated
from the
next by a comma, text strings can be enclosed within quotes, thus allowing for the
string
itself to contain a comma. If the string must contain a quotation mark as well, the
convention is to double up the quotation mark (in the following code we use ʺʺ in
place of
just a ʺ).
A typical control file to load delimited data will look much like this:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
FIELDS TERMINATED BY ʹ,ʹ OPTIONALLY ENCLOSED BY ʹʺʹ
(DEPTNO,
DNAME,
LOC
)
BEGINDATA
10,Sales,ʺʺʺUSAʺʺʺ
20,Accounting,ʺVirginia,USAʺ
30,Consulting,Virginia
40,Finance,Virginia
50,ʺFinanceʺ,ʺʺ,Virginia
60,ʺFinanceʺ,,Virginia
The following line performs the bulk of the work:
FIELDS TERMINATED BY ʹ,ʹ OPTIONALLY ENCLOSED BY ʹʺʹ
It specifies that a comma separates the data fields, and that each field might be
enclosed in
double quotes. When we run SQLLDR using this control file, the results will be:
tkyte@TKYTE816> select * from dept;

DEPTNO DNAME LOC


‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐
10 Sales ʺUSAʺ
20 Accounting Virginia,USA
30 Consulting Virginia

40 Finance Virginia
50 Finance
60 Finance
6 rows selected.
Notice the following in particular:
• ʺUSAʺ – This resulted from input data that was ʺʺʺUSAʺʺʺ. SQLLDR counted the
double occurrence of ʺ as a single occurrence within the enclosed string. In order
to
load a string that contains the optional enclosure character, you must ensure they
are doubled up.
• Virginia,USA in department 20 – This results from input data that was
ʺVirginia,USAʺ. This input data field had to be enclosed in quotes to retain the
comma as part of the data. Otherwise, the comma would have been treated as the
end of field marker, and Virginia would have been loaded without the USA text.
• Departments 50 and 60 were loaded with Null location fields. When data is
missing,
you can choose to enclose it or not, the effect is the same.
Another popular format is tab‐delimited data: data that is separated by tabs rather
than
commas. There are two ways to load this data using the TERMINATED BY clause:
• TERMINATED BY Xʹ09ʹ, which is the tab character using hexadecimal format
(ASCII 9 is a tab character), or you might use
• TERMINATED BY WHITESPACE
The two are very different in implementation however, as the following shows. Using
the
DEPT table from above weʹll load using this control file:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
FIELDS TERMINATED BY WHITESPACE
(DEPTNO,
DNAME,
LOC)
BEGINDATA
10 Sales Virginia
It is not readily visible on the page, but there are two tabs between each piece of
data in the
above. The data line is really:
10\t\tSales\t\tVirginia

Where the \t is the universally recognized tab escape sequence. When you use this
control
file with the TERMINATED BY WHITESPACE clause as above, the resulting data in the
table DEPT is:
tkyte@TKYTE816> select * from dept;

DEPTNO DNAME LOC


‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐
10 Sales Virginia
TERMINATED BY WHITESPACE parses the string by looking for the first occurrence of
whitespace (tab, blank or newline) and then continues until it finds the next non‐
whitespace character. Hence, when it parsed the data, DEPTNO had 10 assigned to it,
the
two subsequent tabs were considered as whitespace, and then Sales was assigned to
DNAME, and so on.
On the other hand, if you were to use FIELDS TERMINATED BY Xʹ09ʹ, as the following
control file does:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
FIELDS TERMINATED BY Xʹ09ʹ
(DEPTNO,
DNAME,
LOC
)
BEGINDATA
10 Sales Virginia
You would find DEPT loaded with the following data:
tkyte@TKYTE816> select * from dept;

DEPTNO DNAME LOC


‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐
10 Sales
Here, once SQLLDR encountered a tab, it output a value. Hence, 10 is assigned to
DEPTNO, and DNAME gets Null since there is no data between the first tab, and the
next
occurrence of a tab. Sales gets assigned to LOC.

This is the intended behavior of TERMINATED BY WHITESPACE, and TERMINATED


BY <character>. Which is more appropriate to use will be dictated by the input
data, and
how you need it to be interpreted.
Lastly, when loading delimited data such as this, it is very common to want to skip
over
various columns in the input record. For example, you might want to load columns 1,
3,
and 5, skipping columns 2 and 4. In order to do this, SQLLDR provides the FILLER
keyword. This allows us to map a column in an input record, but not put it into the
database. For example, given the DEPT table from above, the following control file
contains 4 delimited fields but will not load the second field into the database:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
FIELDS TERMINATED BY ʹ,ʹ OPTIONALLY ENCLOSED BY ʹʺʹ
( DEPTNO,
FILLER_1 FILLER,
DNAME,
LOC
)
BEGINDATA
20,Something Not To Be Loaded,Accounting,ʺVirginia,USAʺ
The resulting DEPT table is:
tkyte@TKYTE816> select * from dept;

DEPTNO DNAME LOC


‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐
20 Accounting Virginia,USA
Load Fixed Format Data
Often, you have a flat file generated from some external system, and this file is a
fixed
length file with positional data. For example, the NAME field is in bytes 1 to 10,
the
ADDRESS field is in bytes 11 to 35, and so on. We will look at how SQLLDR can
import
this kind of data for us.
This fixed width, positional data is the optimal data for SQLLDR to load. It will
be the
fastest for it to process as the input data stream is somewhat trivial to parse.
SQLLDR will
have stored fixed byte offsets and lengths into data records, and extracting a
given field is
very simple. If you have an extremely large volume of data to load, converting it
to a fixed

position format is generally the best approach. The downside to a fixed width file
is, of
course, that it can be much larger than a simple, delimited file format.
To load fixed position data, you will use the POSITION keyword in the control file.
For
example:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
( DEPTNO position(1:2),
DNAME position(3:16),
LOC position(17:29)
)
BEGINDATA
10Accounting Virginia,USA
This control file does not employ the FIELDS TERMINATED BY clause, rather it uses
POSITION to tell SQLLDR where fields begin and end. Of interest to note with the
POSITION clause is that we could use overlapping positions, and go back and forth
in the
record. For example if we were to alter the DEPT table as follows:
tkyte@TKYTE816> alter table dept add entire_line varchar(29);

Table altered.
And then use the following control file:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
( DEPTNO position(1:2),
DNAME position(3:16),
LOC position(17:29),
ENTIRE_LINE position(1:29)
)
BEGINDATA
10Accounting Virginia,USA
The field ENTIRE_LINE is defined as position(1:29) – it extracts its data from all
29 bytes
of input data, whereas the other fields are substrings of the input data. The
outcome of the
above control file will be:

tkyte@TKYTE816> select * from dept;

DEPTNO DNAME LOC ENTIRE_LINE


‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
10 Accounting Virginia,USA 10Accounting Virginia,USA
When using POSITION, we can use relative or absolute offsets. In the above, I used
absolute offsets. I specifically denoted where fields begin, and where they end. I
could
have written the above control file as:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
( DEPTNO position(1:2),
DNAME position(*:16),
LOC position(*:29),
ENTIRE_LINE position(1:29)
)
BEGINDATA
10Accounting Virginia,USA
The * instructs the control file to pick up where the last field left off.
Therefore (*:16) is just
the same as (3:16) in this case. Notice that you can mix relative, and absolute
positions in
the control file. Additionally, when using the * notation, you can add to the
offset. For
example, if DNAME started 2 bytes after the end of DEPTNO, I could have used
(*+2:16).
In this example the effect would be identical to using (5:16).
The ending position in the POSITION clause must be the absolute column position
where
the data ends. At times, it can be easier to specify just the length of each field,
especially if
they are contiguous, as in the above example. In this fashion we would just have to
tell
SQLLDR the record starts at byte 1, and then specify the length of each field. This
will save
us from having to compute start and stop byte offsets into the record, which can be
hard at
times. In order to do this, weʹll leave off the ending position, and instead,
specify the length
of each field in the fixed length record as follows:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
( DEPTNO position(1) char(2),
DNAME position(*) char(14),
LOC position(*) char(13),

ENTIRE_LINE position(1) char(29)


)
BEGINDATA
10Accounting Virginia,USA
Here we only had to tell SQLLDR where the first field begins, and its length. Each
subsequent field starts where the last one left off, and continues for a specified
length. It is
not until the last field that we have to specify a position again, since this field
goes back to
the beginning of the record.
Load Dates
Loading dates using SQLLDR is fairly straightforward, but seems to be a common
point of
confusion. You simply need to use the DATE data type in the control file, and
specify the
date mask to be used. This date mask is the same mask you use with TO_CHAR and
TO_DATE in the database. SQLLDR will apply this date mask to your data and load it
for
you.
For example, if we alter our DEPT table again:
tkyte@TKYTE816> alter table dept add last_updated date;

Table altered.
We can load it with the following control file:
LOAD DATA
INFILE *
INTO TABLE DEPT
REPLACE
FIELDS TERMINATED BY ʹ,ʹ
(DEPTNO,
DNAME,
LOC,
LAST_UPDATED date ʹdd/mm/yyyyʹ
)
BEGINDATA
10,Sales,Virginia,1/5/2000
20,Accounting,Virginia,21/6/1999
30,Consulting,Virginia,5/1/2000
40,Finance,Virginia,15/3/2001

You might also like