You are on page 1of 27

Loading a large volume of Master Data

Management data quickly, Part 2: Using Rapid


Deployment Package direct load with InfoSphere
MDM Server
Skill Level: Intermediate

Paul A. Flores (paflores@us.ibm.com)


Senior Software Engineer
IBM

Neeraj R Singh (sneeraj@in.ibm.com)


Advisory Software Engineer
IBM

Yongli An (yongli@torolab.ibm.com)
MDM performance manager
IBM

01 Jul 2010

The Rapid Deployment Package (RDP) for IBM® InfoSphere™ Master Data
Management Server (MDM) solution addresses the needs of clients in the first phase
of implementing initial load solutions. Using MDM, clients need to perform initial and
delta loads, typically as a batch. This article focuses on the RDP Direct Load
approach using Information Server 8.0.1 to perform the initial data load for an MDM
server solution running InfoSphere MDM Server Version 8.0.1. In addition to the
introduction, installation, and setup for this approach, the article also provides
performance tuning tips and best practices. You can use the recommendations
provided in this article as guidance while developing your own MDM Server initial
load solutions using the RDP Direct Load approach.

Introduction
The Rapid Deployment Package (RDP) for IBM InfoSphere Master Data

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 1 of 27
developerWorks® ibm.com/developerWorks

Management (MDM) Server software is a set of assets designed to implement a


rapid deployment services approach to the initial loading of data into the MDM
Server software repository. The IBM Services organization uses these assets in
applicable projects to address the needs of clients in the first phase of implementing
Master Data Management solutions. At this stage, clients typically implement
InfoSphere MDM Server software in a registry or coexistence architectural style.
Data is loaded into the MDM Server software repository mostly as data changes
from existing legacy systems. With MDM Server software, the client performs initial
and delta loads, typically in a batch. Initial load is the original movement of data from
source systems into the MDM Server software repository when the repository is
empty. Delta loads, all data loading after the initial load, are regular (such as daily)
data updates from source systems into InfoSphere MDM Server software.

There are two different approaches to loading data into InfoSphere MDM Server in
batch. The maintenance service batch approach loads data into InfoSphere MDM
Server using the maintenance services invoked by MDM Server Batch Processor.
The high volume Direct Load approach uses DataStage® and QualityStage™ jobs.

This article shares an IBM team's experience performing case studies focusing on
the RDP Direct Load approach using IBM Information Server (IIS) Version 8.0.1 to
perform the initial data load for an MDM server solution running InfoSphere MDM
Server Version 8.0.1.

The article starts with an introduction to the RDP Direct Load approach for an MDM
Server solution. Then it covers the basic configuration settings and key parameters
that provide optimal performance in the RDP DataStage project. In particular, as part
of the performance tuning tips and best practices, the article explains how to
optimize performance by increasing the concurrency among the DataStage jobs.
The article concludes with a high-level summary of key performance results based
on internal case studies.

After reading the article, you should be able to use the IBM team's experience and
recommendations as guidance when deploying your own InfoSphere MDM Server
initial load solutions that use the RDP Direct Load approach.

Getting started
InfoSphere MDM Server is an enterprise application that helps companies gain
control of business information by enabling them to manage and maintain a
complete and accurate view of their master data. MDM Server provides a unified
operational view of customers, accounts, and products, and it provides an
environment that processes updates to and from multiple channels. It aligns these
front-office systems with multiple back-office systems in real time. This provides
companies with a single source of truth for their master data.

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 2 of 27
ibm.com/developerWorks developerWorks®

During the first phase of implementing an MDM Server solution, clients need to
perform initial and delta loads, typically as a batch. This article introduces one of the
two initial data loading options available for InfoSphere MDM Server Database as
part of the Rapid Deployment approach. The two high-performing data loading
approaches to load large volumes of data into the target MDM Server database are:

• RDP MDM Server Maintenance Service Batch — With this approach, you
load data into MDM Server using the maintenance services invoked by
MDM server Batch Processor. This approach is described in detail in Part
1 of this article series.
• RDP MDM Server Direct Load — With this approach, you leverage the
new options available from IBM InfoSphere Information Server, including
direct load capacity. This is the approach that is described in this article,
which is Part 2 of the series.
As stated above, this article focuses on the RDP Direct Load approach. The article
introduces the RDP Direct Load approach, and then covers the basic steps on how
to set up the MDM Server environment for initial data load with high performance.
The article also describes a set of performance test results that were obtained using
RDP DataStage assets utilizing the optional DB2® EE configuration in an IIS 8.0.1
implementation.

Introducing the RDP Direct Load approach


With the RDP Direct Load approach, you utilize a sophisticated DataStage and
QualityStage implementation known as the RDP Direct Load Asset to load data into
the MDM server. RDP Direct Load Asset provides validation, standardization, and
duplicate suspect processing capabilities that are governed by specific configuration
settings.

The remainder of this article assumes that you have already completed the RDP
Direct Load Asset installation steps that are listed below. Following each of these
prerequisite steps is a pointer to relevant reference information. You can find links to
the product documentation in the Resources section of this article.

• Installation of DataStage and QualityStage — refer to Information Server


Documentation: Planning, Installation, and Configuration Guides.
• Installation of related patches — refer to the installation instructions
delivered with each patch.
• Creation of a DataStage Project — refer to WebSphere® DataStage and
QualityStage Administration Documentation: Administrator Client Guide.
• Establishing and setting project environment variables — refer to

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 3 of 27
developerWorks® ibm.com/developerWorks

WebSphere DataStage and QualityStage Administration Documentation:


Administrator Client Guide. Specific steps are outlined in the README file
included with the distribution of the MDM RDP DataStage assets.
• Importing RDP DataStage Project DataStage Exchange (dsx) file — refer
to WebSphere DataStage and QualityStage Development
Documentation: Designer Client Guide.
• Configuration of the RDP DataStage Project — specific steps are outlined
in the README file included with the distribution of the MDM RDP
DataStage assets.
• Compilation of DataStage jobs — refer to WebSphere DataStage and
QualityStage Development Documentation: Designer Client Guide.
• Installation verification — specific steps are outlined in the README file
included with the distribution of the MDM RDP DataStage assets.
What is the Direct Load approach?

At a high level, the Direct Load approach means that you use the RDP Direct Load
Asset to read incoming data files and validate data elements. Then, depending on
configuration parameters, incoming data can be standardized and matched, and
suspected duplicates can be identified and collapsed. Furthermore, depending again
on a configuration parameter, the data can be written to the MDM database using
either of two methods (ODBC insert or Bulk load). You can use the Direct Load
approach to perform both the initial loading of the database (referred to as the Initial)
as well as processing of updates (referred to as the Delta).

It is important to note that the format of incoming data to the RDP Direct Load Asset
must conform to the MDM Server Standard Interface Format (SIF). Each of the
possible incoming data streams has specific Record Type and Sub Type
designations that are used to identify each incoming record. The SIF record format
also contains delta specific columns to assist in setting the indicator values of
existing database records to NULL. The use of these specific columns ensures that
the related column values are not inadvertently set to NULL.

The Direct Load process flow

The Direct Load process flows through the following five core processes, or phases:

• Import — processes the incoming SIF files and performs record and field
specific processes.
• Validation — processes the imported SIF data and performs specific
business validations on the incoming data, including relational integrity
checks.

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 4 of 27
ibm.com/developerWorks developerWorks®

• Error consolidation — accumulates all errors. If the number of errors


exceeds a threshold specified by a specific parameter, processing is
stopped.
• ID assignment — ensures that the record identifiers are properly
constructed and inter-related.
• Data loading — a specific parameter setting determines whether this
phase inserts records using ODBC or Bulk Load. Data loading is, for the
purpose of this analysis, dependant on DB2, which is the default MDM
database platform. MDM supports other database platforms, but these
are not included in the scope of this article.
The Direct Load process also uses three optional processes, or phases, that require
specific parameter settings in order for them to be engaged:

• Standardize
• Match
• Suspect
Figure 1 provides a diagram of the Direct Load process flow.

Figure 1. Direct Load process flow

Optimizing performance
This section covers specific topics about the RDP Direct Load approach related to
performance tuning.

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 5 of 27
developerWorks® ibm.com/developerWorks

The RDP Direct Load Asset makes use of the parallel processing capability provided
by DataStage. Unfortunately, parallel processing is sometimes confused with
concurrent processing, so the description of concurrent parallel jobs can add even
more confusion. Therefore, the following high level descriptions of these processes
are included to provide some clarity.

Parallel processing has two aspects: data and process. From a data perspective,
parallel processing is achieved by segmenting data into separate partitions. Then,
each partition of data is run through a copy of the same process at the same time.
The number of data partitions is equal to the number of process copies, and is
referred to as the degree of parallelism. In the context of DataStage, parallelism is a
job characteristic and is found within the context of a job.

Concurrent processing is, in effect, the spawning of processes, which is a fairly well
known and understood concept. Concurrency is measured in terms of the number of
jobs that are run at the same time. In the context of DataStage, concurrency is
external to the definition of the job.

Now, keeping in mind the above definitions of each of these kinds of processing, try
to further understand the following:

• A job that is running with one degree of parallelism is a sequential job.


• A job that is running with a degree of parallelism greater than one is a
parallel job.
• Jobs of any degree of parallelism can be run sequentially or concurrently.
In summary, parallel processing, in the context of DataStage, is an internal
characteristic of a job that is measured by the number of data or processing
partitions that are utilized. Concurrency, in the context of DataStage, is the number
of jobs that are run at the same time.

Configuring the RDP Direct Load asset

This section covers the RDP DataStage parameter sets and their settings that
influence execution behavior and performance. The setting of DataStage
environment variables and project parameters is not included here (refer to the
README file that is included with the distribution of the MDM Server RDP
DataStage assets for details on these topics).

The RDP DataStage project utilizes parameter value sets. The WebSphere
DataStage and QualityStage Designer Client Guide provides a complete description
of how to use DataStage parameter sets (refer to the WebSphere DataStage and
QualityStage documentation link in the Resources section).

The RDP DataStage project utilizes four parameter sets, each of which has a

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 6 of 27
ibm.com/developerWorks developerWorks®

different purpose:

• The MDM_EC parameter set consists of parameters that are used for
error control and reporting. For the most part, this parameter set is
shipped pre-set.
• The MDM_CONNECTIONS parameter set consists of parameters that
are specific to database connectivity.
• The MDM_COMPILATION_OPTIONS parameter set is used to "buffer"
differences that can influence execution. These differences can arise
between operating systems, databases, and DataStage.
• The MDMIS parameter set is the main parameter set. It contains the
parameters that have the largest influence on execution behavior and
performance. It is important to note that the MDMIS parameter set is
related to the MDM Server CONFIGELEMENT table. The parameter set
help text is used to populate and update selected records in the
CONFIGELEMENT table (for details about the RDP Direct Load Asset for
MDM Server, refer to the README file that is included with the
distribution of the RDP Direct Load Asset for MDM Server). The individual
parameters in the MDMIS parameter set are:
• DS_LOAD_MODE — This parameter can be set to either INITIAL or
DELTA. When set to INITIAL, the specific execution of DELTA
related processes is suppressed. An incorrect setting for this
parameter (for example, setting the initial loading of an empty
database to DELTA), can result in a number of errors that can cause
execution to abort.
• LOAD_METHOD — Determines the method used to load records into
the MDM Server database. It is set to either ODBC_INSERT or BULK.
The ODBC_INSERT setting is the default. The BULK setting requires
that the DataStage project is correctly setup to use the DB2
Enterprise Stage. (For more information, use the WebSphere
DataStage and QualityStage documentation link in the Resources
section to go to the WebSphere DataStage and QualityStage
Designer Client Guide.)
• QS_STAN_ADDRESS, QS_STAN_ORG_NAME, and
QS_STAN_PERSON_NAME — These parameters invoke the
standardization of address, organization name, and person name,
respectively.
• QS_PERFORM_ORG_MATCH, QS_PERFORM_PERSON_MATCH,
QS_ALLOW_LOB_MATCH, and QS_GEN_IMPLIED_MATCH — These
parameters invoke Match and Suspect processing. Suspect
processing is always invoked along with Match processing. These

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 7 of 27
developerWorks® ibm.com/developerWorks

parameters, prefixed by QS_ , refer to QualityStage specific


functionality that is not covered in this article.
RDP Direct Load Asset configuration parameters

The degree of parallelism (the number of data/processing partitions) utilized by


DataStage jobs is set through the use of a configuration file. The RDP Direct Load
Asset is shipped with the following three default configuration files:

• MDM_1X1.apt is used for sequential processing.


• MDM_Default.apt is used for parallel processing.
• remoteDB2.apt provides an example of a remote DB2 configuration.
A configuration file defines nodes where processing and disk space is allocated for
use in a parallel job. (For more information, use the WebSphere DataStage and
QualityStage documentation link in the Resources section to go the Parallel Job
Developers Guide Chapter 12, "The Parallel engine configuration file.")

There are a large number of factors that can affect performance. These factors
include the number of available disk controllers, the number of processors, the
amount of available RAM, the amount of available disk space, disk fragmentation,
and configuration of virtual memory. Furthermore, the inter-relationships of all these
factors to one another also affects performance. Understandably, the performance
tuning of a DataStage environment can sometimes appear to be more art than
science.

Concurrency

As described previously, concurrency refers to the number of jobs that are run at the
same time. Running multiple jobs concurrently is an effective strategy to increase
overall throughput. However, consideration must be given to making sure that
concurrent jobs are not attempting to use the same data sources. This will help to
avoid the chance for resource contention that can severely degrade performance, or
lead to a record dead lock condition. Consideration must also be made in regard to
the number of active and concurrent database connections. Again, this is to avoid
potential resource contention issues.

A close review reveals that increasing concurrency is not applicable to all phases of
the RDP Direct Load Asset. It is technically feasible to increase concurrency during
the Import phase, but it is not practical. Party Validation jobs currently run in a
somewhat concurrent manner that takes into consideration that they share a number
of database resources. Increasing concurrency in the Validation jobs requires
making extensive job customization to further reduce the use of shared database
resources. For this reason, the RDP Direct Load approach described in this article
does not increase concurrency for the Validation jobs. Instead, it stays as close as

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 8 of 27
ibm.com/developerWorks developerWorks®

reasonably possible to what is shipped.

The functionality related to Standardization is embedded in the existing Validation


jobs. It is not considered to be a candidate for increasing processing concurrency.
Jobs related to ID Assignment and Match/Suspect processing require sequential
processing not so much to avert resource contention but because of their
dependency on each other.

What is left are the jobs related to Data Loading. Because these jobs do not attempt
to insert to the same database tables, the only consideration to keep in mind is the
number of concurrent database connections. Note that database relational integrity
checks are disabled during the load process.

Concurrency case study

This section provides details about specific changes made to increase the
concurrency of the data loading phase of the RDP Direct Load Asset. All the figures
in this section are screenshots from the DataStage Designer client graphical user
interface of the applicable RDP DataStage assets. In these screenshots, the
connector lines between the job objects (green and yellow objects) and
decision/control objects (objects with question marks and blue objects) indicate the
flow, or order, in which jobs are executed.

The performance tuning efforts taken for this case study focused on reducing the
overall time to load Party data into the MDM Server database. As a result, focus was
placed on the RDP DataStage job sequences: DL_000_DELTA_LOAD,
DL_090_LD__Insert_Party, and DL_091__LD_Bulk_Party. DL_000_DELTA_LOAD
controls the invocation of both DL_090_LD__Update_Party_SQL and
DL_090_LD__Insert_Party. The DL_090_LD__Insert_Party job sequence manages
the loading of data either through ODBC insertion or Bulk Load. It consists of
DL_090_LD__Insert_Party and DL_091__LD_Bulk_Party.

As shown in Figure 2, by default the DL_000_DELTA_LOAD job sequence executes


DL_090_LD__Update_Party_SQL and DL_090_LD__Insert_Party_SQL jobs
sequentially.

Figure 2. Sequential Delta Load

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 9 of 27
developerWorks® ibm.com/developerWorks

Figure 3 shows DL_000_DELTA_LOAD after increasing its concurrency. The


DL_000_DELTA_LOAD job sequence now runs the
DL_090_LD__Update_Party_SQL and DL_090_LD__Insert_Party_SQL jobs
concurrently.

Figure 3. Concurrent Delta Load

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 10 of 27
ibm.com/developerWorks developerWorks®

As shown in Figure 4, by default the DL_090_LD__Insert_Party job sequence


executes jobs in a sequential manner.

Figure 4. Sequential Insert Flow

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 11 of 27
developerWorks® ibm.com/developerWorks

Figure 5 shows DL_090_LD__Insert_Party after increasing the number of concurrent


jobs. In this diagram, all the job objects without connectors are invoked at the same
time and run concurrently. The 13 oval call-outs on the diagram provide a count of

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 12 of 27
ibm.com/developerWorks developerWorks®

the number of concurrent process. The constraining factors for performance are the
number of concurrent database connections and available memory (both RAM and
disk) in relation to incoming data volume.

Figure 5. Concurrent Insert Flow

As shown in Figure 6, by default the DL_091__LD_Bulk_Party job sequence


executes jobs in a sequential manner.

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 13 of 27
developerWorks® ibm.com/developerWorks

Figure 6. Sequential Bulk Load

Figure 7 shows DL_091__LD_Bulk_Party after increasing the number of concurrent


bulk loading jobs. It is important to note that bulk loading utilizes a multi-instance job
construct, and by rule, a multi-instance job can not run concurrent versions of itself.
To address these constraints, copies of the multi-instance job were made and
configured to reflect the particular job being changed from sequential to concurrent

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 14 of 27
ibm.com/developerWorks developerWorks®

processing. The six square call-outs in the diagram provide a count of the number of
concurrent bulk loading jobs.

Figure 7. Concurrent Bulk Load

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 15 of 27
developerWorks® ibm.com/developerWorks

Database Tuning (DB2)

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 16 of 27
ibm.com/developerWorks developerWorks®

In general, it is very important to follow best practices and recommendations when


setting up the database server. You should also closely monitor your database
performance, and tune your database as needed for optimal performance and
productive resource usage.

This section briefly describes several recommendations for configuring and tuning a
DB2 database. The basic concepts should apply to other types of databases too.

Typically, it's recommended that you use a set of dedicated disks for DB2
transaction logs and use another set of dedicated disks for DB2 table spaces. If
possible, it's even better if you are able to use different disk controllers for DB2
transaction logs and DB2 table spaces. Using different disk controllers gives you the
flexibility to configure the disk controllers independently for different I/O patterns that
favor writes or a mix of writes and reads. Ensure read and write cache is enabled on
your storage system. Monitor the cache effectiveness and, based on cache usage,
configure the cache size accordingly.

Your overall database performance is limited by the bandwidth of your busiest disks.
Therefore, properly plan your table spaces to ensure balanced I/O operations across
all of your available disks. This helps to avoid hot spots on your database. Optimally,
you want to maximize the utilization of all the I/O bandwidth available from all
physical disks.

Another configuration parameter that dramatically affects performance is the


database buffer pool size. Pay close attention to your overall buffer pool hit ratio.
This tells you how often the physical disks needed to be accessed for data that was
not found in the database buffer pool(s). Accessing the physical disks is expensive
in terms of performance.

A desirable goal for your buffer pool hit ratio is 80% or higher for data, and 90% or
higher for index. Typically, in a MDM Server implementation, if you start with one big
buffer pool for both data and index you should be fine. If necessary, you can
separate data and index into two different buffer pools to help ensure a good index
buffer pool hit ratio. Of course, this also requires that you create separate
tablespaces for data and indexes.

An MDM Server solution typically includes some customization and extension. Use
database snapshots or other tools to analyze the SQLs that have the biggest
negative impact on performance. Ensure those SQLs have optimal access plans and
the best indexes in place.

When implementing the above recommendations, you need to consider their costs
and benefits together as a whole in order to achieve your performance goals. A
behavior of one area might be just the symptom of another incorrectly configured or
misbehaving area.

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 17 of 27
developerWorks® ibm.com/developerWorks

Understanding software and hardware requirements


Following is a typical system topology for an MDM server deployment that uses the
RDP Direct Load approach for initial or delta loading:

• The application server and InfoSphere MDM Server are installed on one
physical box or LAPR with the right CPU capacity (Server1). The number
of CPUs depends on the overall throughput requirements. Note that this
server does not participate in the load process when the Direct Load
approach is used. However, it is used by data stewards to manage the
un-collapsed suspects with the MDM Server DataStewardship user
interface, which in turn, submits transactions to the MDM Server for
processing.
• The database server is installed on another physical box or LPAR
(Server2) with well-equipped I/O capacity.
• The IIS server should be installed either on Server2 with the database
server, or on a third physical box or LPAR (Server3) that has good I/O
bandwidth.
• The IIS Client is used to configure QS jobs. It is installed on a Windows®
machine.
To efficiently maximize the performance from the given configuration, follow these
rules:

• The ratio of the number of CPUs on the MDM server and the database
server can range from 2:1 to 3:1. For example, if you have a database
server with 4 CPUs, the recommended number of CPUs on the MDM
server box is at least 8. This ensures that the CPU capacity on the
database server is well utilized.
• It's recommended that you have 5-10 physical disk spindles available to
the database server for each CPU on the database server.
• The ratio of the number of CPUs on the MDM server and the IIS server
can range from 2:1 to 1:1. For example, if you have an MDM server with 8
CPUs, the recommended number of CPUs on the IIS server box is 4-8.
• The I/O capacity on the IIS box should be at least half of the database
server's I/O capacity.
Exploring the example environment

Following is a brief description of the test environment, including information about


the hardware and software in each layer of the stack:

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 18 of 27
ibm.com/developerWorks developerWorks®

• Server1: Application server and InfoSphere MDM Server (does not play
any role during the loading process when using the Direct Load approach)
• Hardware:
• Machine Type: IBM 9116-561, PowerPC_POWER5™
• CPUs: 8 core Power5 with 16 threads, 1.5GHz, 64 bit
• Memory, I/O: 32 GB RAM, 6 internal disks
• Software:
• OS: AIX® Version 5.3 (5300-06) (64 bit)
• WebSphere Application Server ND 6.1.0.11 (32 bit)
• InfoSphere Master Data Management 8.0.1 + EntryLevelMDM
assets
• Server2: Database server
• Hardware:
• Machine Type: IBM 9116-561, PowerPC_POWER5
• CPUs: 8 core Power5 with 16 threads, 1.5GHz, 64 bit
• Memory, I/O: 32 GB RAM, 6 internal disks + 40 external disks
• Software:
• OS: AIX Version 5.3 (5300-06) (64 bit)
• DB2® database server v9.5 (64 bit)
• Server3: IIS server
• Hardware:
• Machine Type: IBM 9116-561, PowerPC_POWER5
• CPUs: 8 core Power5 with 16 threads, 1.5GHz, 64 bit
• Memory, I/O: 32 GB RAM, 6 internal disks + 40 external disks
• Software:
• OS: AIX Version 5.3 (5300-06) (64 bit)
• IIS v8.0.1
• DataStage v8.0.1
• RDP DataStage Project Release 5 Version 3

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 19 of 27
developerWorks® ibm.com/developerWorks

• QualityStage v8.0.1
• Server4: IIS client — needed to configure DataStage and QualityStage
jobs, not needed while running the test
• Hardware:
• 32 bit x86 machine
• Software:
• OS: Windows 2003 Server
• IIS client for Windows Version 8.0.1
Figure 8 shows the system topology used in the case study.

Figure 8. Case study system topology

Understanding the performance test methodology used in the


example

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 20 of 27
ibm.com/developerWorks developerWorks®

Input data preparation

The example case study used input data in SIF format derived from
maintainContractPlus transactions. The first step toward getting the input data set
was to create the seed-data. The seed-data was generated using a home-grown,
Java-based tool with key distributions based on U.S. Census data from the year
2000. Some realistic data was added to make the overall party data closely match a
typical MDM business scenario. The seed-data contained details such as name,
gender, date of birth, and addresses.

As a second step, a template for the maintainContractPlus transaction request was


created. This template had variables for key party details that needed to be filled in
with generated seed-data. Using this template and the seed-data generated above,
a set of SIF files with 1 million records was created. When loaded into the MDM
Server database, each of these records yielded one person with one name, one
address, one contact, and one contact method. Table 1 shows the detailed profile of
the database tables when populated with one million records.

Suspect duplicate data preparation

So far, the data generated in the example was primarily clean. However, during the
initial load, the input data might have duplicate entries, where details from one
record closely resemble those from another. These records are called suspect
duplicates. The "dirty data" for the example includes 40% suspect duplicates. It was
generated using an approach similar to the approach used to generate the clean
data. This data set was used when Suspect Duplicate Processing was turned on.

When testing the example, the following two sets of data were used:

• 100% clean data — contains no suspect duplicates


• 60% clean data — 40% of total records have one suspect duplicate per
party
Data profile

The following table shows the population of the InfoSphere MDM Server database
tables when the two sets of input data are loaded.

Table 1. Database population


Table Name 100% Clean Data 60% Clean data
ADDRESS 1,000,000 700,000
ADDRESSGROUP 1,000,000 900,000
CONTACT 1,000,000 900,000
CONTACTMETHOD 1,000,000 900,000

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 21 of 27
developerWorks® ibm.com/developerWorks

CONTACTMETHODGROUP 1,000,000 900,000


CONTEQUIV 1,000,000 1,000,000
CONTRACT 1,000,000 1,000,000
CONTRACTCOMPONENT 1,000,000 1,000,000
CONTRACTROLE 1,000,000 1,000,000
IDENTIFIER 1,000,000 900,000
LOBREL 1,000,000 900,000
LOCATIONGROUP 2,000,000 1,800,000
MISCVALUE 1,000,000 1,000,000
PERSON 1,000,000 900,000
PERSONNAME 1,000,000 900,000
PERSONSEARCH 1,000,000 900,000
SUSPECT 0 300,000

Test methodology

For the example, different tests were performed to check stability and to measure
the overhead associated with several commonly used features. The methodology for
each of the tests was similar:

1. Set up the systems. Configure and tune the various components as


described earlier in this article.

2. Prepare a set of input data with 10000 records using the previously
described approach.

3. Load the input data with 10000 records. This is done to avoid deadlocks
while working with an empty database.

4. Perform DB2 reorgchk on all the tables to update statistics.

5. Delete the loaded data so that the initial database is empty. Backup this
database. The backup is now the database to be used as the starting
point for all the tests.

The following steps were used to run the example tests:

1. Restore the database using the backup copy.

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 22 of 27
ibm.com/developerWorks developerWorks®

2. Reset all the jobs in the RDP Direct Load Asset.

3. Run data collection scripts in the background. These scripts collect CPU
statistics, IO statistics, and database snapshots.

4. Start the test to load the selected input dataset.

5. Collect the logs and reports from IIS server and DB2 database server.

Measuring performance results


This section describes the performance measurements carried out to:

• Compare the performance of Bulk mode to ODBC_Insert mode.


• Measure performance overhead of some commonly used features, in the
context of initial data loading.
While running in Bulk mode, there is no performance difference during the initial
stages of processing (import, validation, etc.). However, the stages that load the
data into the database are much faster. In the test environment, the overall time
needed to load 1 million parties using Bulk load, was only about 50% of the time
needed when using ODBC_Insert mode.

For both modes, tests were conducted to measure the overhead associated with
Name and Address standardization and suspect duplicate processing. The
measurements were made with two sets of data as described in the Input data
preparation section. The test results are shown in Table 2.

Table 2. Feature overhead measurement results


RDP Bulk Load RDP ODBC_Insert
Standardization and Suspect 23% 14%
duplicate processing Overhead
— 100% Clean Data
Standardization and Suspect 23% 9%
duplicate processing Overhead
— 60% Clean Data
Loading History Tables Minimal * Minimal *

* = Assuming sufficient I/O capacity on database Server.

Note that enabling standardization and suspect duplicate processing causes some
extra work to be done. However, when data is not clean, the duplicate records get
dropped and there is less data to load into the database. This reduces loading time.

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 23 of 27
developerWorks® ibm.com/developerWorks

The overhead measured above is the net effect of these two factors. In the case of
Bulk mode, the latter factor is insignificant because data loading is very quick. A
10% decrease in data volume does not result in any significant time saving.
However, in the case of ODBC_Insert mode, loading processes run longer and
result in a noticeable time saving. This is why there is less overhead for dirty data as
compared to clean data.

A few tests were also run to assess the impact on total load time when varying the
number of nodes. Performance did improve when the number of nodes in the test
environment was increased, but no significant improvement was noticed after the
number of nodes was increased beyond eight. However, one of the known
limitations of the test environment was that it had less than ideal I/O capacity (due to
a limited number of physical spindles). For your own implementation, you may want
to carry out a few tests to find the number of nodes that is best suited for your
environment.

Note that the performance data described above is for a specific operating
environment and for a given set of test data. Performance results and the overhead
in other operating environments with actual input data may be different than these
test results. This is expected because data loading performance depends upon
many factors, including the I/O configuration, the storage configuration, and the
nature of the workload being processed.

Conclusion
InfoSphere Master Data Management Server has been the leading choice for a large
number of organizations across a range of industries when implementing Master
Data Management solutions. It is designed to provide flexibility in its deployments,
developed on leading technology, and offers significant performance and scalability.
IBM has the largest number of successfully deployed MDM implementations in the
market today. This leadership position of IBM MDM solutions is further strengthened
with the availability of the two performing and scalable rapid deployment approaches
that facilitate the initial and delta data loads. These approaches help customers get
a faster return on their investment. This article described the RDP Direct Load
approach using DataStage jobs. This approach provides a fast and easy way to load
a large volume of initial MDM data.

This article described the RDP Direct Load approach for MDM Server and showed
how to set up RDP Direct Load approach for an InfoSphere MDM Server
environment. It described how to configure the RDP Direct Load asset, and
explained the concepts of concurrency and parallelism, which are the two
dimensions of optimizing data loading performance, in the context of RDP
DataStage jobs. Tuning and optimization tips were provided and demonstrated
through the use of sample case studies. You should now be able to apply the
methodology that the article described to get your own RDP Direct Load approach

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 24 of 27
ibm.com/developerWorks developerWorks®

implementation up and running with high performance.

The last part of the article described some key performance data points from a
couple of common scenarios that demonstrated how the RDP Direct Load approach
provides sustainable high performance for initial data load. You might find these
scenarios useful when conducting capacity planning for an MDM Server system
based on the chosen features and for ensuring the required performance level
during initial load.

Acknowledgements
The authors would like to thank Lena Woolf, Henk Alblas, Mike Carney, Michael
Mahoney, and other MDM/RDP team members for their support and review of this
article.

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 25 of 27
developerWorks® ibm.com/developerWorks

Resources
Learn
• "Loading a large volume of Master Data Management data quickly Part 1: Using
RDP MDM Server maintenance services batch" (developerWorks, 2009 August)
provides information on the alternate approach for loading large volumes of
data into a target MDM Server database.
• The WebSphere Customer Center: Understanding Performance Redpaper
provides a technical context to help you plan your WCC performance goals.
• Use the IBM InfoSphere Master Data Management Server website to find out
more about this product.
• The IBM InfoSphere MDM Server Information Center provides you online
access to the product documentation.
• The IBM Information Server Information Center provides you online access to
the product documentation.
• The Master Data Management: Rapid Deployment Package for MDM Redbook
documents the procedures for implementing this solution.
• The DataStage and QualityStage Documentation webpage document includes
links to the latest PDF documentation for IBM® InfoSphere Information Server
and its components
• The IBM DB2 Database for Linux®, UNIX® and Windows Information Center
provides you online access to the product documentation.
• "DB2 Tuning Tips for OLTP Applications" (developerWorks, 2002 January) is a
classic developerWorks article that provides a number of DB2 tuning tips.
Discuss
• Participate in the discussion forum for this content.
• Check out the developerWorks blogs and get involved in the developerWorks
community.

About the authors


Paul A. Flores

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 26 of 27
ibm.com/developerWorks developerWorks®

Paul Flores joined IBM in 2008 as DataStage Consultant in the


Advanced Consulting Group transitioning to the Software Group as a
Senior Software Engineer as a result of his involvement in the
development and testing of RDP since its inception. He has a wide
range of both design and development knowledge and experience. He
holds a Bachelors Degree in Mathematics and a Masters Degree in
Computer Information Sciences.

Neeraj R Singh
Neeraj R Singh is currently a senior performance engineer working on
Master Data Management Server performance. He has prior
experience leading the Java technologies test team for functional,
system, and performance tests as technical lead and test project
leader. He joined IBM in 2000 and holds a Bachelors Degree in
Electronics and Communications Engineering.

Yongli An
Yongli An is an experienced performance engineer focusing on Master
Data Management products and solutions. He is also experienced in
DB2 database server and WebSphere performance tuning and
benchmarking. He is an IBM Certified Application Developer and
Database Administrator - DB2 for Linux, UNIX, and Windows. He joined
IBM in 1998. He holds a bachelor degree in Computer Science and
Engineering and a Masters degree in Computer Science. Currently
Yongli is the manager of the MDM performance and benchmarks team,
focusing on Master Data Management Server performance and
benchmarks, and helping customers achieve optimal performance for
their MDM systems.

Using Rapid Deployment Package direct load with InfoSphere MDM Server Trademarks
© Copyright IBM Corporation 2010. All rights reserved. Page 27 of 27