You are on page 1of 6

Partitioning File Sources

When a session uses a file source, you can configure it to read the source with one thread or with multiple
threads. The Integration Service creates one connection to the file source when you configure the session to
read with one thread, and it creates multiple concurrent connections to the file source when you configure
the session to read with multiple threads. Use the following types of partitioned file sources:

Flat file. You can configure a session to read flat file, XML, or COBOL source files.

Command. You can configure a session to use an operating system command to generate source
data rows or generate a file list.

When connecting to file sources, you must choose the same connection type for all partitions. You may
choose different connection objects as long as each object is of the same type.
To specify single- or multi-threaded reading for flat file sources, configure the source file name property for
partitions 2-n. To configure for single-threaded reading, pass empty data through partitions 2-n. To
configure for multi-threaded reading, leave the source file name blank for partitions 2-n.

Rules and Guidelines for Partitioning File Sources


Use the following rules and guidelines when you configure a file source session with multiple partitions:

Use pass-through partitioning at the source qualifier.

Use single- or multi-threaded reading with flat file or COBOL sources.

Use single-threaded reading with XML sources.

You cannot use multi-threaded reading if the source files are non-disk files, such as FTP files or
WebSphere MQ sources.

If you use a shift-sensitive code page, use multi-threaded reading if the following conditions are
true:

The file is fixed-width.

The file is not line sequential.

You have not enabled user-defined shift state in the source definition.

To read data from the three flat files concurrently, you must specify three partitions at the source
qualifier. Accept the default partition type, pass-through.

If you configure a session for multi-threaded reading, and the Integration Service cannot create
multiple threads to a file source, it writes a message to the session log and reads the source with
one thread.

When the Integration Service uses multiple threads to read a source file, it may not read the rows
in the file sequentially. If a sort order is important, configure the session to read the file with a
single thread. For example, sort order may be important if the mapping contains a sorted Joiner
transformation and the file source is the sort origin.

You can also use a combination of direct and indirect files to balance the load.

Session performance for multi-threaded reading is optimal with large source files. The load may be
unbalanced if the amount of input data is small.

You cannot use a command for a file source if the command generates source data and the session
is configured to run on a grid or is configured with the resume from the last checkpoint recovery
strategy.

Using One Thread to Read a File Source


When the Integration Service uses one thread to read a file source, it creates one connection to the source.
The Integration Service reads the rows in the file or file list sequentially. You can configure single-threaded
reading for direct or indirect file sources in a session:

Reading direct files. You can configure the Integration Service to read from one or more direct
files. If you configure the session with more than one direct file, the Integration Service creates a
concurrent connection to each file. It does not create multiple connections to a file.

Reading indirect files. When the Integration Service reads an indirect file, it reads the file list and
then reads the files in the list sequentially. If the session has more than one file list, the Integration
Service reads the file lists concurrently, and it reads the files in the list sequentially.

Using Multiple Threads to Read a File Source


When the Integration Service uses multiple threads to read a source file, it creates multiple concurrent
connections to the source. The Integration Service may or may not read the rows in a file sequentially. You
can configure a multi-threaded reading for direct or indirect file sources in a session:

Reading direct files. When the Integration Service reads a direct file, it creates multiple reader
threads to read the file concurrently. You can configure the Integration Service to read from one or
more direct files. For example, if a session reads from two files and you create five partitions, the
Integration Service may distribute one file between two partitions and one file between three
partitions.

Reading indirect files. When the Integration Service reads an indirect file, it creates multiple
threads to read the file list concurrently. It also creates multiple threads to read the files in the list
concurrently. The Integration Service may use more than one thread to read a single file.

Configuring for File Partitioning


After you create partition points and configure partitioning information, you can configure source connection
settings and file properties on the Transformations view of the Mapping tab. Click the source instance name
you want to configure under the Sources node. When you click the source instance name for a file source,
the Workflow Manager displays connection and file properties in the session properties. You can configure
the source file names and directories for each source partition. The Workflow Manager generates a file name

and location for each partition. The following table describes the file properties settings for file sources in a
mapping:

Configuring Sessions to Use a Single Thread


To configure a session to read a file with a single thread, pass empty data through partitions 2-n. To pass
empty data, create a file with no data, such as empty.txt, and put it in the source file directory. Then, use
empty.txt as the source file name.
Note: You cannot configure single-threaded reading for partitioned sources that use a command to generate
source data.
The following table shows the source file name and values when the Integration Service creates one thread
to read ProductsA.txt. It reads rows in the file sequentially. After it reads the file, it passes the data to three
partitions in the transformation pipeline:

The following table shows the source file name and values when the Integration Service creates two threads.
It creates one thread to read ProductsA.txt, and it creates one thread to read ProductsB.txt. It reads the
files concurrently, and it reads rows in the files sequentially:

If you use FTP to access source files, you can choose a different connection for each direct file.

Configuring Sessions to Use Multiple Threads


To configure a session to read a file with multiple threads, leave the source file name blank for partitions 2n. The Integration Service uses partitions 2-n to read a portion of the previous partition file or file list. The
Integration Service ignores the directory field of that partition.
To configure a session to read from a command with multiple threads, enter a command for each partition
or leave the command property blank for partitions 2-n. If you enter a command for each partition, the
Integration Service creates a thread to read the data generated by each command. Otherwise, the

Integration Service uses partitions 2-n to read a portion of the data generated by the command for the first
partition.
The following table shows the attributes and values when the Integration Service creates three threads to
concurrently read ProductsA.txt:

The following table shows the attributes and values when the Integration Service creates three threads to
read ProductsA.txt and ProductsB.txt concurrently. Two threads read ProductsA.txt and one thread reads
ProductsB.txt:

The following table shows the attributes and values when the Integration Service creates three threads to
concurrently read data piped from the command:

The following table shows the attributes and values when the Integration Service creates three threads to
read data piped from CommandA and CommandB. Two threads read the data piped from CommandA and
one thread reads the data piped from CommandB:

Configuring Concurrent Read Partitioning


By default, the Integration Service does not preserve the row order when multiple partitions read from a
single file source. To preserve row order when multiple partitions read from a single file source, configure
concurrent reads partitioning. You can configure the following options:

Optimize throughput. The Integration Service does not preserve the row order when multiple
partitions read from a single file source. Use this option if the order in which multiple partitions read
from a file source is not important.

Keep the relative input row order. Preserves the sort order of the input rows read by each
partition. Use this option if you want to preserve the sort order of the input rows read by each
partition. The following table shows an example sort order of a file source with 10 rows by two
partitions: Partition Rows Read

Partition #1

1,3,5,8,9

Partition #2

2,4,6,7,10

Keep absolute input row order. Preserves the sort order of all input rows read by all partitions.
Use this option if you want to preserve the sort order of the input rows each time the session runs.
In a pass-through mapping with passive transformations, the order of the rows written to the target
will be in the same order as the input rows.
The following table shows an example sort order of a file source with 10 rows by two partitions:
Partition Rows Read
Partition #1

1,2,3,4,5

Partition #2

6,7,8,9,10

Note: By default, the Integration Service uses the Keep absolute input row order option in sessions
configured with the resume from the last checkpoint recovery strategy.

You might also like