Siperian Hub Implementer Guide

XU
Siperian Hub XU
Implementer’s Guide
© 2007 Siperian, Inc.
Copyright 2007 Siperian, Inc. [Unpublished - rights reserved under the Copyright Laws of the United
States]
THIS DOCUMENTATION CONTAINS CONFIDENTIAL INFORMATION AND TRADE
SECRETS OF SIPERIAN, INC. USE, DISCLOSURE OR REPRODUCTION IS PROHIBITED
WITHOUT THE PRIOR EXPRESS WRITTEN PERMISSION OF SIPERIAN, INC.
Contents
Preface
Intended Audience ............................................................................................................................................x
Contents.............................................................................................................................................................xi
Learning About Siperian Hub .......................................................................................................................xii
Contacting Siperian ........................................................................................................................................xiv
Chapter 1: Introducing Siperian Hub Implementation

Siperian Implementation Methodology..........................................................................................................2
Reducing Project Risk.............................................................................................................................2
Core Principles.........................................................................................................................................2
Roles in a Siperian Hub Implementation Project .........................................................................................4
Phases in an Siperian Hub Implementation Project ....................................................................................6
Discover Phase.........................................................................................................................................6
Analyze Phase...........................................................................................................................................7
Design Phase ............................................................................................................................................7
Build Phase ...............................................................................................................................................8
Deploy Phase............................................................................................................................................8
Chapter 2: Analyzing Data

Getting Started ...................................................................................................................................................9
Defining the Flow of Data Between Siperian Hub and Source/Target Systems ..................................10
Determine Data Source Characteristics .............................................................................................10
Assemble a Statistically Representative Sample Data Set................................................................11
Consider Data Sizing.............................................................................................................................11
Consider the Relationship Between Data and Business Processes................................................12
Consider Data Cleansing and Standardization Rules .................................................................................12
Consider Trust Levels and Validation Rules ...............................................................................................12
iii
Trust Levels............................................................................................................................................ 13
Validation Rules..................................................................................................................................... 14
Consider Match Rules .................................................................................................................................... 14
Chapter 3: Designing the Data Model

About Data Modeling for MRM................................................................................................................... 18
Data Model Design Deliverables ....................................................................................................... 18
Conceptual Model................................................................................................................................. 19
Logical Model ........................................................................................................................................ 20
Physical Model....................................................................................................................................... 24
Design Principles............................................................................................................................................. 27
Principle 1: Consider Deep Versus Wide .......................................................................................... 28
Principle 2: Match Requirements Drive the Model ......................................................................... 29
Principle 3: Consolidation Counts...................................................................................................... 30
Principle 4: Pass the Independence Test .......................................................................................... 33
Principle 5: Mix Different Types of Customers Carefully.............................................................. 36
Principle 6: Landing and Staging Data............................................................................................... 40
Design Patterns ............................................................................................................................................... 42
Households ............................................................................................................................................ 42
Addresses................................................................................................................................................ 43
Populating the Address Household Object ...................................................................................... 45
Communication Channel Models....................................................................................................... 46
Chapter 4: Using Trust Settings and Validation Rules

Using Trust Levels .......................................................................................................................................... 52
About Trust Levels ............................................................................................................................... 52
How Trust Works ................................................................................................................................. 52
Ranking Source Systems According to Trustworthiness ................................................................ 55
Trust Best Practices .............................................................................................................................. 58
Configuring Trust Levels ..................................................................................................................... 60
Example Stored Procedure to Calculate Decayed Trust................................................................. 63
Using Validation.............................................................................................................................................. 65
iv Siperian Hub XU Implementer’s Guide

About Validation Rules.........................................................................................................................65
How Validation Works .........................................................................................................................65
Best Practices for Validation Rules.....................................................................................................68
Using Trust and Validation Together...........................................................................................................70
Scenarios Involving Trust and Validation for a Column.................................................................70
What Happens When a Record Is Updated ......................................................................................71
Example Using Trust Levels and Validation Rules Together.........................................................72
Chapter 5: Configuring and Tuning Match Rules

About Matching ...............................................................................................................................................76
Before You Start Defining Your Match Rules..................................................................................76
Steps in the Match Process ..................................................................................................................76
Populations .......................................................................................................................................................77
Tokens for Match Keys .................................................................................................................................77
Determining When to Tokenize Your Data......................................................................................78
Match Key Widths.................................................................................................................................79
Match Key Types and Mixed Data .....................................................................................................80
Search Strategies ..............................................................................................................................................80
Match Purposes ...............................................................................................................................................81
Using the Match Purposes to Match People .....................................................................................82
Using the Match Purposes to Match Organizations ........................................................................82
Using the Match Purposes to Match Addresses ...............................................................................82
Name Formats .......................................................................................................................................82
Field Types Used in Purposes .............................................................................................................83
Match Levels ..........................................................................................................................................84
Defining and Testing Your Match Rules .....................................................................................................85
About Testing.........................................................................................................................................86
Matching Best Practices..................................................................................................................................86
Exact Match Column Properties...................................................................................................................87
Null Match..............................................................................................................................................87
Segment Match.......................................................................................................................................88
Using Matching on Dependent Tables ..............................................................................................91
Setting Match Batch Sizes ..............................................................................................................................91
Contents v
Using Dynamic Match Analysis Threshold................................................................................................. 92
Tuning Match for Performance .................................................................................................................... 92
About Merging ................................................................................................................................................ 94
Chapter 6: Implementing Hierarchy Manager

About Hierarchy Manager ............................................................................................................................. 96
Before You Begin Implementing Hierarchy Manager............................................................................... 97
Defining Your Goals ............................................................................................................................ 97
Understanding the Data ....................................................................................................................... 97
Assembling the Team ........................................................................................................................... 98
Determining Resources ........................................................................................................................ 98
About Implementing a Hierarchy Manager System .................................................................................. 98
Step 1: Analyze Your Data .................................................................................................................. 99
Step 2: Build the Data Model ............................................................................................................ 102
Step 3: Configure Your Hierarchy Manager Implementation...................................................... 102
Step 4: Load Data................................................................................................................................ 102
Chapter 7: Scheduling Batch Jobs and Batch Groups

About Scheduling Siperian Hub Batch Jobs............................................................................................. 104
Setting Up Job Execution Scripts............................................................................................................... 104
Metadata in the C_REPOS_TABLE_OBJECT_V View............................................................. 104
Identifiers in C_REPOS_TABLE_OBJECT_V............................................................................ 106
Determining Available Execution Scripts ....................................................................................... 107
Retrieving Values from C_REPOS_TABLE_OBJECT_V at Execution Time ....................... 107
Running Scripts Asynchronously...................................................................................................... 108
Monitoring Job Results and Statistics ........................................................................................................ 108
Error Messages and Return Codes................................................................................................... 108
Job Execution Status .......................................................................................................................... 108
Job Scheduling Reference ............................................................................................................................ 111
Alphabetical List of Jobs.................................................................................................................... 111
Autolink Jobs ....................................................................................................................................... 112
Auto Match and Merge Jobs ............................................................................................................. 113
vi Siperian Hub XU Implementer’s Guide

Automerge Jobs ...................................................................................................................................115
BVT Snapshot Jobs.............................................................................................................................116
Generate Match Token Jobs..............................................................................................................118
Key Match Jobs....................................................................................................................................120
Load Jobs ..............................................................................................................................................121
Manual Link Jobs.................................................................................................................................123
Manual Unlink Jobs.............................................................................................................................125
Match Jobs............................................................................................................................................127
Match Analyze Jobs.............................................................................................................................128
Match for Duplicate Data Jobs .........................................................................................................130
Stage Jobs..............................................................................................................................................131
Unmerge Jobs.......................................................................................................................................133
Scheduling Batch Groups.............................................................................................................................137
About Batch Groups...........................................................................................................................137
Stored Procedures for Batch Groups ...............................................................................................138
Developing Custom Stored Procedures for Batch Jobs..........................................................................145
About Custom Stored Procedures ....................................................................................................145
Required Execution Parameters for Custom Batch Jobs ..............................................................145
Example Custom Stored Procedure .................................................................................................146
Registering a Custom Stored Procedure ..........................................................................................149
Chapter 8: Implementing Custom Buttons in Hub Console Tools

About Custom Buttons in the Hub Console ............................................................................................151
How Custom Buttons Appear in the Hub Console.......................................................................152
What Happens When a User Clicks a Custom Button..................................................................154
Adding Custom Buttons...............................................................................................................................155
Writing a Custom Function ...............................................................................................................155
Controlling the Custom Button Appearance ..................................................................................159
Deploying Custom Buttons ...............................................................................................................159
Contents vii
viii Siperian Hub XU Implementer’s Guide
Preface
Welcome to the Siperian Hub Implementer’s Guide. This guide explains how to design and
implement your Master Reference Manager (MRM) system.
This guide has been written for database administrators, system administrators, data
stewards, application developers, and other members of an MRM implementation team
who are responsible for MRM implementation and configuration tasks. To learn more,
see “Intended Audience” on page x.
You must be familiar with the platform on which Siperian Hub is installed. If that
platform is Windows, then you must also have knowledge of Microsoft Windows
Component Services, which is required for Siperian Hub™. Database administrators
must be familiar with the database environment on which they have installed MRM.
Knowledge of Oracle administration is particularly important.
Other administration and configuration tasks are described in the Siperian Hub
Administrator’s Guide and Siperian Hub User’s Guide.
This guide assumes that MRM and all supporting software components have been
installed. To learn more about installing MRM, see the Siperian Hub Installation Guide for
your platform.
ix
Intended Audience
Intended Audience
This guide is intended for the following audiences:
Audience Description
MRM Implementers Those responsible for designing, developing, testing, and
deploying MRM according to the requirements of the
organization. All of the chapters in this book are recommended
for implementers.
Hierarchy Manager Those responsible for designing, developing, testing, and
Implementers deploying Hierarchy Manager according to the requirements of
the organization. See Chapter 6, “Implementing Hierarchy
Manager”.
Data Stewards Custodians of data quality. In Siperian terms, data stewards are the
people responsible for reviewing and, where necessary, correcting
and manually merging business data on a regular and ongoing
basis. While the primary resources for data stewards is the Siperian
Hub User’s Guide, data stewards will also find the following
chapters useful:
• Chapter 1, “Introducing Siperian Hub Implementation”
• Chapter 2, “Analyzing Data”
• Chapter 3, “Designing the Data Model”
• Chapter 4, “Using Trust Settings and Validation Rules”
Siperian Administrators IT people responsible for configuring or updating a Hub Store so
that it provides the rules and functionality required by the data
stewards. While the primary resource for administrators is the
Siperian Hub Administrator’s Guide, administrators will also find the
following chapters useful:
• Chapter 5, “Configuring and Tuning Match Rules”
• Chapter 4, “Using Trust Settings and Validation Rules”
x Siperian Hub XU Implementer’s Guide

Contents
Contents
This guide contains the following chapters:
Chapter 1, “Introducing Introduces the overall Siperian Hub implementation process and
Siperian Hub describes key concepts you need to understand before starting a
Implementation” Siperian Hub implementation project.
Chapter 2, “Analyzing Describes activities involved with analyzing data for a Siperian Hub
Data” implementation project.
Chapter 3, “Designing the Describes what implementers need to know need before building
Data Model” the data model for a Siperian Hub implementation project.
Chapter 4, “Using Trust Provides a brief overview of how trust settings and validation rules
Settings and Validation work together, best practice recommendations, and examples.
Rules”
Chapter 5, “Configuring Describes how to use and tune match rules.
and Tuning Match Rules”
Chapter 6, “Implementing Describes concepts, methodology, design patterns, and other
Hierarchy Manager” information that implementers need to know before beginning a
Hierarchy Manager™ (HM) implementation project.
Chapter 7, “Scheduling Explains how to schedule Siperian Hub batch jobs using job
Batch Jobs and Batch execution scripts.
Groups”
Chapter 8, “Implementing Explains how to add custom buttons to tools in the Hub Console
Custom Buttons in Hub that allow users to invoke external services on demand.
Console Tools”
xi
Learning About Siperian Hub

Siperian Hub Documentation Navigator
The Siperian Hub Documentation Navigator directs you to the books in the Siperian Hub
documentation that are most useful to you based on your role.
Siperian Hub Installation Guide
The Siperian Hub Installation Guide for your platform explains how to install Siperian
Hub and Cleanse Match Server. There is a Siperian Hub Installation Guide for each
supported platform.
Siperian Hub Release Notes
The Siperian Hub Release Notes contain important information about this release of
Siperian Hub. Read the Siperian Hub Release Notes before installing Siperian Hub.
What’s New in Siperian Hub
What’s New in Siperian Hub provides an enhanced description of the new features for
this release.
Siperian Hub Tutorial
The Siperian Hub Tutorial walks you through various Siperian Hub implementation tasks
on a step-by-step basis.
Siperian Hub Administrator’s Guide
The Siperian Hub Administrator’s Guide explains how to configure, administer, and
manage a Siperian Hub implementation. It provides a description of the Siperian Hub
platform through a discussion of Siperian Hub concepts, services, tools, and databases.
Administrators should read the Siperian Hub Administrator’s Guide first.
xii Siperian Hub XU Implementer’s Guide

Siperian Hub User’s Guide
The Siperian Hub User’s Guide explains how to use Siperian Hub. It provides a
description of the Siperian Hub platform through a discussion of Siperian Hub
concepts and tasks. Data stewards and users who are new to Siperian Hub should read
the Siperian Hub User’s Guide first.
Siperian Hub Implementer’s Guide
The Siperian Hub Implementer’s Guide explains how to design, implement, test, and deploy
a Siperian Hub implementation. Implementers must be familiar with the content of the
Siperian Hub Administrator’s Guide as well as the Siperian Hub Implementer’s Guide before
starting a Siperian Hub implementation.
Siperian Services Integration Framework Guide
The Siperian Services Integration Framework Guide explains how to use the Siperian Hub
Services Integration Framework (SIF) to integrate Siperian Hub functionality with your
applications and how to create applications using the data provided by Siperian Hub.
SIF allows you to integrate Siperian Hub smoothly with your organization's
applications.
Siperian Training and Materials
Siperian provides live, instructor-based training to help you become a proficient user as
quickly as possible. From initial installation onward, a dedicated team of qualified
trainers ensure that your staff is equipped to take advantage of this powerful platform.
To inquire about training classes or to find out where and when the next training
session is offered, please visit our web site or contact Siperian directly.
xiii
Contacting Siperian
Contacting Siperian
Technical support is available to answer your questions and to help you with any
problems encountered using Siperian products. Please contact your local Siperian
representative or distributor as specified in your support agreement. If you have a
current Siperian Support Agreement, you can contact Siperian Technical Support:
Method Contact Information

World Wide Web http://www.siperian.com
E-Mail support@siperian.com
Voice U.S.: 1-866-SIPERIAN (747-3742)
We are interested in hearing your comments about this book. Send your comments to:
by E-Mail: docs@siperian.com
by Postal Service: Documentation Manager
Siperian, Inc.
1820 Gateway Dr., Suite 109
San Mateo, CA 94404
xiv Siperian Hub XU Implementer’s Guide

1
Introducing Siperian Hub Implementation
This chapter introduces the overall Siperian Hub implementation process and describes
key concepts you need to understand before starting a Siperian Hub implementation
project. It provides a framework and methodology for implementing Siperian Hub in a
Siperian customer environment. This framework is intended to help with
implementation planning in conjunction with the particular requirements of your
Siperian Hub implementation. Although every Siperian Hub implementation is unique
in specific ways, certain principles, patterns, and best practices can apply generally
across most Siperian Hub implementations.
Before you attempt to implement your Siperian Hub system, you should be intimately
familiar with Siperian Hub and proficient in using the Siperian Hub tools. To learn
more about using Siperian Hub, read through the following documents:
• Siperian Hub User’s Guide
• Siperian Hub Administrator’s Guide
Chapter Contents
• Siperian Implementation Methodology
• Roles in a Siperian Hub Implementation Project
• Phases in an Siperian Hub Implementation Project
1
Siperian Implementation Methodology

The Siperian implementation methodology provides a comprehensive set of
procedures, guidelines, best practices, templates, and checklists for implementing the
Siperian Hub in a customer environment. It is intended to provide project teams with
the flexibility to tailor an implementation project to meet their specific needs, while still
providing the structure and guidance required to successfully implement Siperian Hub.
Reducing Project Risk

The main focus of the Siperian implementation methodology is to reduce project risk
by:
• Standardizing the approach to implementing Siperian solutions through the use of
best practices and templates
• Applying a risk avoidance-based scheduling approach to all project plans so that
high-risk components of the project plan are completed as early as possible
• Including checkpoint review processes to help keep projects on track
• Providing sufficient knowledge transfer of Siperian products and implementation
methodology, along with associated skills, to customers and implementation
partners
Core Principles
The Siperian implementation methodology is deliverables-based, not time-based.
Deliverables are produced by specific activities that are grouped into five gated phases
(described in “Phases in an Siperian Hub Implementation Project” on page 6). Gated
phases mean that the project needs to pass through a checkpoint gate (a specific review
process) before any activities for the next phase can begin.
The objective of checkpoint gate reviews is not to enforce a rigid waterfall

methodology in which everything must be completed, approved, and signed off before
any activities in the next phase can begin. Used on its own, the Siperian
implementation methodology allows for overlap between phases, with as much
concurrency as possible, without exposing the project to unacceptable risk.
The checkpoint gate reviews determine whether a sufficient portion of the deliverables
2 Siperian Hub XU Implementer’s Guide

from the current phase have been delivered with acceptable quality before the phase
can be considered complete.
The Siperian implementation methodology can be used on its own or it can be

incorporated into many other methodologies, such as PMBOK, Prince2, Iterative,
Waterfall, RAD, and others (including your own in-house methodology). If you do
incorporate the Siperian implementation methodology into your enterprise project
management methodology, then your approach to starting a new phase will be
determined by the guidelines of your particular enterprise project management
methodology.
The Siperian implementation methodology is a project-based methodology that is

based on the following principles:
• A project is a temporary and unique endeavor.
• A project has a start date and an end date.
• A project has a specific scope that is constrained by time, cost, and quality.
• A project contains risk that must be managed.
The final goal of any project implemented under the guidelines of the Siperian
implementation methodology is to deliver a fully configured, tested, and deployed
Siperian Hub environment with the appropriate levels of project documentation.
Introducing Siperian Hub Implementation 3

Roles in a Siperian Hub Implementation Project

A Siperian Hub implementation project usually involves the following roles, various of
which might be held by customers, Siperian, or a third-party integrators.
Typical Roles in a Siperian Hub Implementation Project
Role Responsibilities
Customer Project Manages the overall project, including:
Manager
• Provides day-to-day project management, planning, and tracking
• Ensures that all issues and change requests have been
communicated/resolved in a timely manner
• Defines and communicates resource needs
• Provides best practices and program management guidance
• Assists in requirements definition
Technical Lead • Primary technical representative on project team
• Participates in analysis, design, and testing activities
• Manages Master Data design and implementation, including:
Data Modeling
Business Rules
Data Loads
Rules Tuning
Consolidation QA
Package/View Configuration
Database • Configures the database for Siperian Hub
Administrator • Sets up the Hub databases
• Works with the Solution Architect during Hub database
performance testing and tuning
System Configures the required hardware and infrastructure software
Administrator
Solution Architect Provides expert advice, counsel, and technical expertise to the project
team to help assure that Siperian solutions are designed and developed
in the optimal manner and in accordance with industry and Siperian best
practices
Hub Builder Assists with Siperian Hub design, development, testing, and deployment

Typical Roles in a Siperian Hub Implementation Project (Cont.)

Role Responsibilities
EAI Specialist Provides the design and development of EAI programs
ETL Specialist Provides the design and development of ETL programs/modules
Web Services Provides the design and development of Web interface applications
Specialist
Checkpoint Provides an independent review of designs and deliverables at key
Reviewer junctures in the project to help assure the quality of the end product
The distinctions here are fluid and project-dependent. For a given Siperian Hub
implementation project, a single team member might be responsible for multiple roles,
and a single role might be shared among multiple team members.

Phases in an Siperian Hub Implementation Project

A Siperian Hub implementation can be broken down into five distinct phases:
• Discover Phase
• Analyze Phase
• Design Phase
• Build Phase
• Deploy Phase
Each phase has specific activities and deliverables.
Note: A sixth phase, the management of steady-state processes for supporting the
environment post-deployment, is outside the scope of this document.
Discover Phase
The Discover phase initiates the implementation project and includes the following
activities:
• Identifying the overall vision driving the need for the project
• Analyzing the high-level requirements for the project

• Defining scope restrictions for the project

• Defining the high-level solution architecture
• Project planning and costing, along with all underlying assumptions
• Assessing project risk and defining risk mitigation strategies
• Defining service level agreements (SLAs) for key systemic qualities, such as
scalability, high availability, and performance
Note: Describing the Discover phase is outside the scope of this document.
Analyze Phase
The Analyze phase involves refining the analysis of the system requirements, including:
• Detailed source data analysis
• Detailed requirements definition
• Detailed gap analysis
• Evaluation and acquisition of any third party solutions
• Refining the solution architecture
Design Phase
The Design phase focuses on translating the requirements of the analyze phase into
concrete designs that can be implemented and tested in the build phase. It includes
• Data modeling
• Interface design
• Definition of business rules for cleansing, matching, merging, and maintaining data
• Codification of standards and conventions
• Definition of test cases

Build Phase
The Build phase focuses on the following activities in a development environment:
• Siperian Hub installation and setup
• Configuring Siperian Hub to implement the data model and rules defined in the
design phase
• Fine-tuning the rules
• Developing any interfaces between Siperian Hub and the source and target systems
• Security and rules configuration
• Testing the interfaces and rules
Deploy Phase
The Deploy phase involves:
• Deploying the fully built, tested, and accepted solution into a production
environment
• Wrapping up the project
• Handing the system over to the appropriate system support team
• Training

2
Analyzing Data
This chapter describes activities involved with analyzing data for a Siperian Hub
implementation project.
Chapter Contents
• Getting Started
• Defining the Flow of Data Between Siperian Hub and Source/Target Systems
• Consider Data Cleansing and Standardization Rules
• Consider Trust Levels and Validation Rules
• Consider Match Rules
Getting Started
A critical early step in a Siperian Hub implementation project is to gain a thorough
understanding of the data that you are integrating. For example, for each data source,
you must know the data’s relative accuracy, structure, size, trends in the data, the
amount of data, the expected growth of the data set, and any other characteristics that
are peculiar to the data.
Data analysis is performed in the Analyze phase. The Analyze phase follows the
Discover phase, during which a high-level data analysis is performed in order to
identify any data issues or gaps that could impact project scope, timeline, costs, or risks.
The Analyze phase includes both data analysis and business and functional
requirements analysis. Data analysis and requirements analysis tend to happen in
9
Defining the Flow of Data Between Siperian Hub and Source/Target Systems
parallel with each other. The findings from data analysis often impact the requirements
specification, and vice versa. However, data analysis is not dependent on requirements
analysis.
Defining the Flow of Data Between Siperian Hub and

Source/Target Systems
Data analysis begins by determining the source systems that will feed data into MRM.
You must know exactly what data is coming—and where it is coming from—by
understanding what sources feed data into Siperian Hub, as well as what target systems
are fed updates from Siperian Hub. At a high-level (in the Discover phase), it is just a
system-level bubble diagram. By the time the technical design document is completed
in the Design phase, it has evolved to the level of specific files or tables.
Determine Data Source Characteristics

For each data source, consider the following tasks:
• For each data source, determine the size, data type, data age, quality, quantity,
source, and any other characteristics that are peculiar to the data set.
• Determine any data quality issues.
• Check the primary keys that are available in the data.
• Gain an understanding of the data cardinality—between entities, as well as
consolidation cardinality.
• Determine total data volumes, expected delta volumes, and load frequencies per
source.
• Identify any special initial data load requirements for the system.
• Analyze data for invalid conditions, and then perform frequency analysis to
determine how often those conditions occur per source.
• Differentiate between invalid data conditions that can or cannot be remedied
through data cleansing. The latter data conditions are the ones that should be
considered in defining trust and validation rules.
• It is important to identify what is the more correct data, not just the more correctly
formatted data.

Defining the Flow of Data Between Siperian Hub and Source/Target Systems
• Consider which external systems, including source systems, should be updated

when data changes in a base object. For example, you might want to update the
CRM system whenever a customer’s address gets changed. Message queue triggers
can be configured in the Hub Console so that data changes can be published to
outbound message queues for retrieval by external systems. To learn more, see the
Siperian Hub Administrator’s Guide.
Assemble a Statistically Representative Sample Data Set

To assist in data analysis, assemble a complete, diverse, but statistically representative
sample of your production data from each source system. This sample should contain
various types of non-identical duplicates. The more closely the sample data reflects the
typical characteristics of the production data set, the more useful it will be. Having a
sample data set is an invaluable resource for designing, configuring, and testing match
rules.
Consider Data Sizing

Developing detailed knowledge about data sources provides the basis for correctly
sizing your MRM implementation. Consider the following factors:
• data volume—number of rows, size of rows, large data sets, amount of raw data,
ratio of raw to consolidated records, how “matchy” the data is
• data volatility—the frequency of updates to the data within the source system
• load frequency—how often this data will be brought into MRM to update the
master records
• data model—number of base objects
• history retention and audit requirements
• number of source systems
• match rules
• performance requirements, if applicable
Analyzing Data 11
Consider Data Cleansing and Standardization Rules
Consider the Relationship Between Data and Business

Processes
It is essential to understand the importance of:
• each column’s data to the business processes and business users that produce it.
• the quality of the data capture processes and data validation processes in each
source system
• how closely aligned is your use of the data to the purposes of the people with
whom the data originates (closer alignment is more reliable)
Consider Data Cleansing and Standardization Rules

When analyzing data, consider source attributes that would benefit from data cleaning
via the use of data cleansing and standardization rules. Cleanse lists are intended to
facilitate data conversion during the staging process to ensure that the data that ends
up in the staging table is in a standardized, consistent format. For each source, the
appropriate transformation from source specific codes to the standard codes can be
achieved with a cleanse list maintained in MRM. This will also enable the base objects
to contain the actual standardized code values (as opposed to the Rowid_Object
pointing to the standard code value).
If cleanse lists are used to standardize codes, then a lookup table can be set up in MRM
for each code to validate the code during data loading, ensuring that any record
containing an erroneous code for which there is not a cleanse list entry does not get
propagated into the base objects.
Consider Trust Levels and Validation Rules

During the analysis and design phases of a project, it is important to identify the
factors affecting the trust levels of your source data, and to determine what validation
rules need to be implemented. Although configuring trust levels occurs later in the
Siperian Hub implementation process, you should begin thinking about trust level
settings and validation rules during data analysis. As you analyze the data, you learn
more about its varying levels of accuracy. This knowledge contributes to the trust rules
design.

Consider Trust Levels and Validation Rules
The quality of the data (as defined by the relative importance of the source system and
the relative quality of the data coming from that source system) is the main factors in
determining trust settings. If you find during data analysis that some data is typically
erroneous, then you probably want to give it a lower trust score.
To learn more about defining trust settings and validation rules, see Chapter 4, “Using
Trust Settings and Validation Rules”. For more information on using the MRM tools to
set trust levels, see the Siperian Hub Administrator’s Guide.
Trust Levels
In MRM, the Siperian Trust Framework ensures that its consolidated records, at the
cell level, contain the most reliable information available from the data sources. Trust is
a mechanism for measuring the confidence factor associated with each cell based on its
source system, change history, and other business rules. Trust takes into account the
validity of the data, the age of the data, and how much its reliability has decayed over
time. For more information about trust settings, see “Using Trust Levels” on page 52.
Trust is assigned at the column level. It can be specified, for example, that Source
System 1 is more reliable for “customer name” but Source System 2 is more reliable for
“phone number”. There are several parameters that can be set to assign Trust for each
source system’s column, such as:
• Maximum (initial) Trust level for a new data value
• Minimum Trust level for an “old” data value
• Decay Period or length of time that the trust level takes to decay from the
Maximum Trust to the Minimum Trust
• Decay Type or the shape of the decay curve (a straight line or a curve)
For example, the “Email Address” from a Web application might be assigned
Maximum Trust of 80, Minimum Trust of 20, Decay Period of 1 year, and Decay Type
of SIRL (Slow Initial, Rapid Later), indicating a curve that decays gently at first and
more rapidly later.
In addition to internal data sources, consider data sources that are not controlled
within your organization. For example, suppose your organization purchases data sets
Analyzing Data 13
Consider Match Rules
from a third-party provider. These data sources might be guaranteed to consist of

unique records with a high level of accuracy. Accordingly, you could decide to designate
a high level of trust for this data.
Validation Rules
A validation rule tells Siperian Hub the condition under which a data value is not valid.
If data meets the criterion specified by the validation rule, then the trust value for that
data is downgraded by the percentage specified in the validation rule. To learn more
about validation rules, see “Using Validation” on page 65.
Here are some examples of validation rules:

• Downgrade trust on Last Name if length(last_name) < 3 and last_
name<> ‘NG’
• Downgrade trust on middle_name if middle_name is null
• Downgrade trust on Address Line 1, City, State, Zip and Valid_
address_ind if Valid_address_ind= ‘False’
If the Reserve Minimum Trust flag is enabled (checked) for a column, then the trust
cannot be downgraded below the column’s minimum trust setting.

Although configuring match rules occurs later in the Siperian Hub implementation
process, you should begin thinking about match rules during data analysis because the
data analysis will turn up data characteristics that govern the match rules. Therefore, as
you analyze data, do so with match rules in mind.
During data analysis, identify which columns are appropriate for matching.
For example, if a gender column is null 80% of the time, then this column is probably
not a column to use in a match rule. Similarly, investigate the distribution of data so
that you can assess in advance how selective a match rule needs to be for certain
columns.

To learn more about defining match rules, see Chapter 5, “Configuring and Tuning
Match Rules”. For more information on using the MRM tools to configure match
rules, see the Siperian Hub Administrator’s Guide.
Analyzing Data 15

3
Designing the Data Model
This chapter describes what implementers need to know need before building the data
model for a Siperian Hub implementation project. It is recommended for all
implementers and anyone else who must understand the Master Reference Manager
data model. To learn more about the data model, see the Siperian Hub Administrator’s
Guide.
Note: This chapter assumes that the reader is familiar with conventional data modeling
methodologies—it supplements conventional data modeling techniques with
MRM-specific recommendations.
Chapter Contents
• About Data Modeling for MRM
• Design Principles
• Design Patterns
17
About Data Modeling for MRM

Data modelers and design consultants responsible for defining the data model for
MRM require expertise in relational modeling at the conceptual, logical, and physical
levels. The following sections introduce the various types of models necessary to
develop a Siperian Hub implementation:
• Data Model Design Deliverables
• Conceptual Model
• Logical Model
• Physical Model
Data Model Design Deliverables

The process of designing the data model for consolidated reference data for a Siperian
Hub implementation involves a series of deliverables. The following figure shows the
major phases of the Siperian implementation methodology, along with the data model
delivered in each phase.
• The design starts with a conceptual model, which identifies the main objects to be
managed in MRM. It also identifies which objects will be consolidated, because
match criteria ultimately drive modeling decisions for the physical model.
• The conceptual model is used as the starting point for the logical model, which
provides a logical representation of the entities and attributes to be managed in
MRM.
• The logical model is transformed into a physical model, which is the model that is
then defined in MRM using the Schema Manager in the Hub Console.
Transitioning from a logical model to an ideal MRM physical model involves
design principles that are described in “Design Principles” on page 27 later in this

chapter. The physical model is the final output from the data modeling design
steps, and it is the model that the business and system owners need to approve.
The following figure shows the increasing level of detail and number of entities in
conceptual, logical, and physical models.
Conceptual Model
The purpose of the conceptual model is to identify and describe the main objects
needed to create a global business view of the data, with little detail. This step is often
skipped in typical IT projects, or it might be combined with the logical model.
However, for Siperian Hub implementations, it is very important to go through this
step because it starts the process of thinking about match requirements, which impact
the physical model design.
The conceptual model for a Siperian Hub implementation shows the business entities
that will need to be managed in MRM, along with the relationships among the business
entities and some high-level design properties. If you have worked with entity
relationship diagrams (ERDs), the conceptual model might look similar. To facilitate
logical and physical (or logical to physical) data model design, the Match and Merge
and Intertable Match Parent properties are the most critical properties to identify (to
learn more, see the Siperian Hub Administrator’s Guide). One approach is to begin with
the worst case match scenario, determine the elements in the token match table, and
then trim this down to the tables that would be realistically used for matching.
Designing the Data Model 19

The following figure shows an example of a conceptual data model.
The conceptual model must be derived from the system requirements, with inputs
from analyses of internal and external business system data sources.
Note: For some projects, a pre-existing logical data model might be available. In such
cases, it is still important to create a conceptual data model to ensure that you have
identified the Match and Merge requirements that can have a significant impact on the
subsequent physical data model.
Logical Model
The purpose of building a logical model is to confirm that the application will satisfy
the business requirements. A logical model represents the entities, relationships, and
attributes that are representative of the business information needs. A logical model is
usually a normalized model. Normalization is the process of determining stable
attribute groupings in entities with high interdependency and affinity.
By defining entities, attributes, and their relationships, you might discover data model
design flaws that could produce anomalies. Data flaws include:
• Missing entities
• Multiple entities that represent the same conceptual entity

• Many-to-many relationships that need additional entities to resolve the

many-to-many relationship by creating an intersection table, thus turning the
many-to-many relationship into two one-to-many relationships.
• Multivalued and redundant attributes
Example Logical Model with Design Flaws
The following figure shows an example of a logical model that has some design flaws.
This logical model is based on the previous conceptual model example shown in the
figure in “Conceptual Model” on page 19. It has the following design flaws:
1. Affiliation Role probably needs a Lookup table to define the different types of
Affiliation Roles (missing entity).
2. Repeating attributes (phone numbers, fax numbers, email addresses) can be
normalized into an Electronic Address entity.

Example Logical Model with Fixed Design Flaws
The following figure shows the logical model after it has been fully normalized and
missing entities have been added.
The logical model includes the following new entities:

3. An Electronic Address entity has been added to handle the repeating phone and
fax number attributes (which have therefore been removed from the Customer
Address intersection table).
4. An Electronic Address Type table has been added to provide definitions for the
types of electronic address represented in each record.
5. An Affiliation Role lookup table has been added.

Pre-Existing and New Logical Models
Before considering how the logical model will transition to a physical model, it is
important to get the logical model right. In some Siperian Hub implementations, a
pre-defined logical model is available. In such situations, you still need to evaluate the
logical model to make sure that:
• it meets the stated business needs
• it makes sense logically
• the entities and attributes in the logical model can be populated from the source
systems (there is little point in modeling entities or attributes that cannot be
populated from the source systems)
The pre-existing logical model might not be tuned to work particularly well in MRM.
Therefore, you will need to determine how to transition that logical model to a suitable
physical model.
In other Siperian Hub implementations, you will need to define the logical model from
scratch. In such cases, the logical model can be defined in a way that suits the business
needs and is more closely aligned with the models for which MRM is tuned.
Objects in the Logical Model
When modeling for MRM, the logical model must focus on the actual entities that will
be defined in MRM as base objects or dependent objects.
Objects in the Logical Model
Type of Object Description
Base Objects Used to describe central business entities, such as Customer, Product,
or Employee. In a base object, data from multiple sources can be
consolidated or merged. Trust settings are used to determine the most
reliable value for each base object cell. In addition, one-to-many
relationships (foreign keys) can be defined between base objects.
Dependent Objects Used to store detailed information about the rows in a base object
(such as header-detail relationships). One row in a base object table can
map to several rows in a dependent object table.

You do not model history, cross-references, and so on, as MRM automatically creates
and manages these structures for you. In addition, avoid adding landing tables or
staging tables to the logical model, because they clutter the model unnecessarily.
You can model landing tables as part of the physical model.
Remember that the logical model is not an enterprise-wide data model. The logical
model is a model for reference data only, and it is usually only for a specific subset of
the reference data (such as Customer data). Similarly, do not include transaction data in
the logical model, and limit the model to the reference data that is to be managed in
MRM. Finally, bear in mind that the physical model—not the logical model—is the
actual model that you will implement for MRM.
Physical Model
The physical model is the actual model that you define using the Schema Manager in
the Hub Console (to learn more, see the Siperian Hub Administrator’s Guide). It is thus a
subset of the complete physical schema that will be generated by MRM. The physical
model diagram shows the base objects, dependent objects, and landing tables to be
implemented in MRM.
The rule of thumb for physical model diagrams is to show the user-defined entities and
attributes, plus the primary and foreign keys, so that relationships can be modeled
correctly. In the physical model, avoid showing MRM-generated entities or attributes
other than primary and foreign keys. All supporting tables—such as cross-references,
history tables, control tables, and staging tables—will be created by MRM and
therefore are not included in the physical model diagram.
MRM is flexible enough to implement any logical model as a physical model, but it is
tuned to work better with some types of models than with others. Performance is the
main driver for differences between the logical model and the physical model. Before
you develop a physical model for a Siperian Hub implementation, you must carefully
review your logical model in light of its performance implications. An ideal physical
model for MRM is a balance between a completely denormalized model (best
performance) and highly normalized (best flexibility).

The following figure shows an example physical model based on the logical model
described previously.
Notice that all of the entities defined in the logical model will be implemented as base
objects and that ROWID_OBJECT is used for all primary keys. In addition, notice that
the many-to-many relationship between the Customer and Address entities in the
logical model has been changed to a one-to-many relationship in the physical model.
The reasons for these changes will be explained in the “Design Principles” on page 27
section later in this chapter.
When designing the physical model, consider the following factors:

• Required Functionality
• Performance and Scalability
• Flexibility for Future Use
• Siperian Product Roadmap

Required Functionality
Required functionality is one of the key factors affecting design decisions in the
physical model. Some examples of functionality requirements include:
• If you must keep a history of changes to attributes of an object, then define that
object as a base object.
Performance and Scalability
A completely denormalized model gives the fastest performance, particularly for merge
and unmerge, as there are fewer child tables to be updated on merge or unmerge.
However, a completely denormalized model limits both flexibility and functionality.
The more denormalized the model, the fewer levels of consolidation are available, and
the more difficult it can be to add new data sources and new attributes or entities in the
future. You must therefore find a balance between modeling for performance
(denormalizing) and modeling for functionality/flexibility (normalizing). You should
not denormalize simply for the sake of denormalizing—there are some areas that are
better to denormalize than others, as they yield the most performance benefit with the
least functionality/flexibility loss. These issues are discussed in detail in the “Design
Principles” on page 27 section later in this chapter.
Flexibility for Future Use
When defining the physical model, it is important to keep possible future requirements
in mind, but without adding entities or attributes that cannot yet be maintained or that
that are not yet fully understood. Sometimes building in system flexibility is as simple
as naming things flexibly. For example, if you are building a Customer master for
Organization data and you know that the plan is to add Person data to that Customer
Master within the next year, then consider using a name other than “Organization”
(such as “Business Party”) for the Customer table because the table may well end up
containing both Organization and Customer data.
Be wary of adding physical limitations that might later cause problems. One example of
this is specifying user-defined unique keys on base objects. If you define a unique key
on a base object, you cannot merge records in that base object. Although this might
not be a problem in the initial implementation of a project, it is not uncommon for

Design Principles
new sources that are later added to a system will bring their own values for the base
object with the unique key, making it desirable to use match and merge functionality to
consolidate the new system’s data with that of the original system’s data.
Siperian Product Roadmap
An optimal physical design for a Siperian Hub implementation takes into account what
is known of future requirements, the Siperian product roadmap, and the intersection
between them. If you have any questions about how your model relates to the Siperian
product roadmap, arrange (through Siperian Support) for a data model review with
Siperian Solutions Delivery and Engineering.
If you model types of objects (such as households) or types of relationships that are
not discussed in this document, then you should review the data model with your
Siperian Solution Architect to make sure that the model does not run contrary to any
assumptions in MRM design, QA, or planned features. This review should be
conducted as part of the data model checkpoint review that should already be built into
your project plan.
Design Principles
This section describes some underlying design principles for transitioning from a highly
normalized logical model to a physical model.
• Principle 1: Consider Deep Versus Wide
• Principle 2: Match Requirements Drive the Model
• Principle 3: Consolidation Counts
• Principle 4: Pass the Independence Test
• Principle 5: Mix Different Types of Customers Carefully
• Principle 6: Landing and Staging Data

Design Principles
Principle 1: Consider Deep Versus Wide

This design principle refers to the number of direct child tables linked to a parent table.
The following figure shows the two different types of designs.
This principle applies when you want to merge or unmerge on the parent table.
The design principle mainly affects performance of the merge and unmerge processes.
The more directly-linked child tables that a parent table has, the more those tables must
have foreign key references updated when records merge in the parent table. Therefore,
the more child tables a parent table has, the slower will merges for the parent table be.
This principle applies to the unmerge process as well. For unmerges in a deep model,
consider how far you allow unmerges to cascade. Which child tables need to have
cascade unmerge enabled? How many child tables deep should you choose to enable
the Unmerge on Parent Unmerge flag? The more child tables you have with merged
records and the Unmerge on Parent Unmerge flag enabled, the more work the
unmerge needs to do, and therefore the slower the unmerge process.

Design Principles
Principle 2: Match Requirements Drive the Model

Match criteria also drive physical data model decisions with respect to functionality.
Intertable match criteria involves the use of attributes from one table in the match rules
of a related table—for example, matching customers using address information from
the Address table. For more information, see “Address Example” on page 31.
Another area in which required match functionality can affect the physical model is the
way in which match rules must be defined. If you need to define an AND match rule,
you need to denormalize repeating attributes that are to be used in the match rule.
Normalizing repeated attributes into a child table allows OR match rules on the
normalized attributes, not AND match rules.
For example, if you create an Electronic Address table that contains phone numbers
and e-mail addresses, you can use these in a match rule that identifies records as
matching if their phone numbers are the same OR if their email addresses are the same.
If you need a match rule that identifies records as matching if their phone numbers
match AND their email addresses match, then you need to denormalize these into
separate columns.
The following figure shows an example of a normalized Electronic Address table that
supports OR match conditions only.
This Electronic Address table supports match rules in which phone numbers matched
OR e-mail addresses matched. In the example shown, Customer IDs 12345, 45678,
and 00001 would all be identified as matches for one another because of their
matching phone numbers.

Design Principles
The following figure shows an example of denormalized attributes to support AND

match conditions.
Logically, this table shows the same data as in the normalized Electronic Address table,
but the physical structure has been denormalized to support match conditions that
specify AND criteria. In this example, Customer IDs 12345 and 45678 would match
because their phone numbers match AND their email addresses match. Customer ID
00001 would not be considered a match for the other two records because it has a
different e-mail address. For more information, see “Communication Channel Models”
on page 46.
Principle 3: Consolidation Counts

The physical model must take into account the required results after consolidation and,
particularly, the desired cardinality of base object to cross-references after consolidation
(where cardinality is the ratio of the number of records in the base object to the
number of records in the cross-reference). The physical model must also consider the
effects of source updates on the surviving record. This section describes several
examples to illustrate this principle.
Physician Specialities Example
A physician can have one or more specialties. Pharmaceutical companies are often
interested in identifying only the primary specialty for a physician. However, when two
physician records are merged from different sources, those sources might provide
different values for the physician's primary specialty. If the required cardinality after
merging the specialties is one surviving primary specialty, then you should include
Primary Specialty as a column on the Physician base object. However, if the
pharmaceutical company wants to keep all of the specialties for the merged physician
record, then Physician Specialty must be a child table of the Physician table.

Design Principles
Address Example
Logically, a single address can belong to multiple customers. For example, office
addresses can be shared by colleagues at the same location, or group practice addresses
can be shared by partners in the same law firm. Of course, a customer can also have
multiple addresses. For this reason, logical models usually have customer and address as
distinct entities with a many-to-many intersection table between them.
However, in a physical model for consolidated data, this approach is not necessarily
practical, especially if you are trying to reduce duplication in addresses from multiple
sources. Consolidating addresses when they are not directly linked to customers means
that you are consolidating addresses across customers. For example, in the following
figure, N.E. One and Ann Other both have the same address. If the two address
records are merged, then one survived address record will remain and that record will
be linked to both N.E. One and Ann Other through the Customer Address
intersection table.
Avoid consolidating addresses across customers unless there is a real business need for
an enterprise-wide unique ID per physical address location. Even if there is a real
business need, there are other ways to model this instead. For more information, see
“Design Patterns” on page 42.
Consolidating addresses across customers involves limiting address changes to the right
customers, performance considerations, and functionality considerations.

Design Principles
Limiting Address Changes to the Right Customers
If one Customer changes their address, then you need to make sure that the address
change is not automatically applied to the consolidated address record for all
customers. For example, in the figure shown in “Address Example” on page 31, if N.E.
One moves their office, it does not mean that Ann Other has also moved their office,
so the consolidated address that was previously linked to both N.E. One and Ann
Other now belongs only to Ann Other.
Performance Considerations
Consolidating addresses across customers means that you usually have a high degree of
cardinality between the source addresses and the resultant consolidated addresses.
The higher the number of duplicate records, the more work the merge must do to
process them. The cardinality is reduced if Customer ID is one of the match criteria for
addresses—that is, if addresses are consolidated only within customer records, not
across them. The following figure shows the recommended approach for customer
address relationships.
Using this approach also reduces the number of tables that must be staged and loaded.
This approach does not necessarily yield a large performance gain if your
implementation involves only a handful of source systems to process. However, the
more source systems that are configured to process, the higher will be the performance
impact that each additional target table has on stage and load batches. For example, a
Siperian Hub implementation with five sources for the previous model (shown without
consolidated addresses in “Business Party and Differentiated Customer Models” on
page 36) requires 15 stage jobs and 15 load jobs. An implementation with ten sources
for that same model requires 30 stage and 30 load jobs. For the model with

Design Principles
consolidated addresses, five sources require ten stage and ten load jobs, and ten sources
requires 20 stage and 20 load jobs.
Functionality Considerations
Modeling customer address as a direct (one-to-many) relationship between customer

and address means that customer address attributes can be stored directly on the
Customer Address base object or as a child base object linked to Customer Address.
As long as the attributes are part of a base object, MRM tracks their history.
This approach also means that Customer can use attributes from child tables of the
Customer Address table for matching.
Similarly, keeping customer address attributes in base objects means that duplicate or
overlapping attribute values from multiple sources can be consolidated to get to “best
of breed” values for those attributes.
Principle 4: Pass the Independence Test

Independent base objects are base objects that are not linked to the core consolidated
object through a one-to-many or a many-to-one relationship, but are instead linked
through many-to-many intersection tables. If a base object is modeled as an
independent base object, then its records should make sense on their own, without any
reference to the core base object. It should make sense to consolidate its records to a
distinct set of values.
Steps for Testing Independence
The independence test for a physical model includes the following steps:
1. Identify the core base object that is being consolidated in the Hub
Store—Customer in a Customer Master, Supplier in a Supplier Master, and so on.
2. Look for any many-to-many relationships (direct or indirect).
3. Inspect the base object that is on the other side of the many-to-many relationship
and ask the question: “What can the business do with a distinct list of the things in
this object without knowing who the Customer is?” If the answer is “Nothing,”
then change the many-to-many to a one-to-many relationship.

Design Principles
Example Using a Highly Normalized Model
The following figure shows an example of a highly normalized model.
In this model, Specialty, Address, and Electronic Address are all linked to the core
object—Customer—through many-to-many relationships. You can therefore apply the
independence test by asking the following questions:
Question Answer
What can the business do with a distinct list The distinct list of Specialties can be used to
of Specialties without knowing who the provide a pick or lookup list of Specialty
Customer is? values in a capture screen for new Customer
information. The business wants to
standardize the list of Specialties it uses in
reporting by assigning each source specialty to
a consolidated enterprise specialty value.
What can the business do with a distinct list In most cases, the answer to this question is
of Addresses without knowing who the “Nothing.” Addresses are usually meaningful
Customer is? only in terms of the Customer to whom the
Address belongs.
What can the business do with a distinct list “Nothing”—a telephone number has no
of Electronic Addresses (for example, significance in its own right.
telephone numbers) without knowing who the
Customer is

Design Principles
Converting relationships from many-to-many to one-to-many for the objects that failed
the independence test would result in the model shown in the following figure.

Design Principles
Principle 5: Mix Different Types of Customers Carefully

In Siperian Hub implementations, you must be careful when mixing different types of
customers.
Business Party and Differentiated Customer Models
This principle focuses on the consequences of implementing two different models—a

Business Party model versus a Differentiated Customer model, which are shown in the
following figure.
Model Description
Business Party Model All Customer records are loaded into the same Business Party table,
and an attribute on that table identifies the type or classification of
the Customer records. In this example of a Business Party model, the
Class of Customer attribute distinguishes Organizations from
Individuals.
Differentiated The type or classification of the Customer records is implied by
Customer Model where the records are stored. In this example of a Differentiated
Customer model, the Organization table holds Customers classified
as Organizations, and the Individual table holds Customers classified
as Individuals.
Data modelers often prefer the Differentiated Customer model because it reduces null
attributes on the Customer table (for example, the Organization Customer does not
need to carry any attributes that apply only to an Individual Customer). However, there
are definite advantages to using a Business Party model over a Differentiated Customer

Design Principles
model, even if it does result in more null attributes on the Business Party table. Such
advantages include:
• The Business Party model easily supports any number of chained relationships
between different classes of customers and/or the same classes of customers.
• The Business Party model allows you to model networks, not just parent/child
hierarchies.
• The Business Party model provides a single unique identifier for each Customer
without any chance of overlap.
• The Business Party model allows you to search for Customers in one place without
needing to know anything about the type of Customer.
• The Business Party model allows you to identify source records that have given
Customers incorrect types.
Mixing Models
In your Siperian Hub implementation, you might decide to implement a Business Party
model so that you get one unique Customer identifier and you can model Customer
Affiliations flexibly. If you want to avoid too many redundant/null value columns on
the Business Party base object, you can use child tables to carry some of the attributes
that are specific to specific sub-types of Customers. However, if you do this, you must
be very careful about how you mix the Business Party and Differentiated Customer
models.

Design Principles
The following figure shows a poor mix of these models.

Design Principles
The following figure shows a better way to mix these models.
If the merge performance is a concern, then consider using a pure Business Party
model, as shown in the figure in “Business Party and Differentiated Customer Models”
on page 36.
This is a better mix than the figure showing a poor mix of models because it simplifies
the relationships between the objects and reduces the number of cross-table joins
required to get the match data. The preferred model is still the full business party
model shown in “Business Party and Differentiated Customer Models” on page 36, as
that reduces the number of child tables to be maintained on merge and unmerge.
The Customer match attributes have been denormalized so that they are attributes of
the Business Party base object instead of the Organization and Individual base objects.
This reduces the number of cross-joins used in populating the match token.
In the better mix, all relationships have been defined at the Business Party level,
making it easier to navigate and maintain the relationships. The poor mix has an uneasy
mixture of relationships, with Addresses having nullable foreign keys to either
Individual or Organization.

Design Principles
Principle 6: Landing and Staging Data

This principle considers how you design landing and staging tables in your Siperian
Hub implementation.
Landing Table
Although we have no strong design recommendations with respect to landing tables,

consider the following issues for your Siperian Hub implementation:
• Some implementations have used source-specific landing tables (a landing table per
source table/source file). This keeps the landing table format closer to the source
format and means that the ETL process does not need to transform all sources to
a standard layout, which could simplify the process of making changes for one
source or adding new sources with different attributes later. However, it usually
also means a very large number of landing tables, which can be tedious and
cumbersome to set up.
• Other implementations have used one landing table per target table, which means
that the ETL needs to transform all sources for the same target to the same
standard layout. This approach does allow the ETL to be standardized, making it
much faster to develop and test for the first implementation (where typically a large
number of sources need to be coded). It is possible that this approach also makes
it more costly to maintain after initial deployment, because changes from one
source could potentially affect multiple ETL mappings.
If you use one landing table per target table in your Siperian Hub implementation,
then the landing table needs to include a source identifier, which must be used in
filtering the data mapped to each staging table. The landing table should also have
a range partition specified in Oracle to partition it according to source system,
which allows partitions to be truncated before the ETL inserts data from a source,
rather than having records deleted from the landing table.

Design Principles
Staging Tables
Staging tables must be based on the columns provided by the source system for the
target base object or dependent object for which the staging table is defined, even if the
landing tables are shared across multiple source systems. If you do not make the
column on staging tables source-specific, then you create unnecessary trust and
validation requirements.
Trust is a powerful mechanism, but it carries performance overhead. Use trust where it
is appropriate and necessary, but not where the most recent cell value will suffice for
the surviving record. For more information, see “Using Trust Levels” on page 52.
If you limit the columns in the staging tables to the columns actually provided by the
source systems, then you can restrict the trust columns to those that come from two or
more staging tables. Use this approach instead of treating every column as if it comes
from every source, which would mean needing to add trust for every column, and then
validation rules to downgrade the trust on null values for all of the sources that do not
provide values for the columns.
More trust columns and validation rules obviously affect the load and the merge
processes. Also, the more trusted columns, the longer will the update statements be for
the control table. Bear in mind that Oracle and DB2 have a 32K limit on the size of the
SQL buffer for SQL statements. For this reason, more than 40 trust columns result in a
horizontal split in the update of the control table—MRM will try to update only 40
columns at a time.

Design Patterns
Design Patterns
This section summarizes the following typical physical data model design scenarios and
describes options for implementing them:
• Households
• Addresses
• Populating the Address Household Object
• Communication Channel Models
Households
A Household is a grouping of customer records according to geographic location.
For example, all of the people living at one address could be considered a household,
or a group of doctors practicing at one hospital could be considered a household.
Create Household as a base object that is the parent of Customer. The easiest type of
household is one in which the household has no attributes of its own. It uses
inter-table match to match on selected Customer match columns that usually include
the Address match columns.

Design Patterns
The following figure shows an example of a logical mode for Households.
Addresses
The ideal model for addresses involves a one-to-many relationship from Business Party
to Address, with Address match rules that include Business Party ID to prevent
matches across different Business Parties. However, there are occasionally business
cases for consolidating addresses across Business Parties, such as to get a single
identifying key for all addresses for the same location, regardless of which Business
Parties use that address. If there are business reasons for consolidating Addresses

Design Patterns
regardless of the Business Parties using the Addresses, then the following consolidated
address model is recommended.
In this model, the Business Party Address base object consolidates the Addresses per
Business Party. The Business_Party_ROWID is one of the match criteria for the
Business Party Address base object, and Business Party Addresses should merge only if
they have the same Business_Party_ROWID value.
The Business Party Address base object gives you the distinct set of addresses for each
business party, but it does not give you a distinct set of all the addresses with a unique ID
for each unique address. To get a unique set of Address identifiers, the Address base
object would need to be included in the data model.
At its simplest, the Address base object does not include any attributes of its own,
other than a Status Indicator to indicate whether the Address ID is active or inactive.
Instead, it uses intertable match to match using the attributes from the Business Party
Address table. This approach assumes that tight matching rules are used for the
Address base object, and that survivorship of household-specific attributes is not
required. If household-specific attributes need to be survived, then those attributes
must be defined and populated for the Address base object, along with the appropriate
Trust rules.

Design Patterns
Populating the Address Household Object

The Address household object is a standard base object that is populated through
landing and staging tables. At the cross-reference level, there is one-to-one cardinality
between the Address base object (Address cross-reference) and the Business Party
Address base object (Business Party Address cross-reference).
Landing Tables
The Address object should share landing tables with the Business Party Address base
object. The Address base object uses the same pkey_src_object values as the
Business Party Address.
Staging Tables
The Address object must have its own staging tables. As for any other base object, the
Address base object requires a separate staging table for each source system that can
populate it. Each Address staging table usually only has pkey_src_object and last_
update_date columns, unless there are other, household-specific attributes to be
included.
If hard delete detection is being used to deactivate unused address identifiers, then the
staging tables may also include a status_ind column.
Populating Address and Populating Business Party Address
When you first load Address and Business Party Address (before doing any matching
and merging on either), there will be the same number of records in Address and in
Business Party Address. The cross-reference tables will also have the same number of
records. Each Business Party Address record has a foreign key referencing a record in
the Address table.
As you merge records in the Address table, records in the Business Party Address table
will be updated to reflect the surviving Address key. Once all Address records have
been merged, you can merge records in the Business Party Address table.

Design Patterns
Each new Business Party Address record will have a corresponding new Address
record that will need to be matched against the existing Address records.
Communication Channel Models

Communication channel data refers to electronic and telephonic address information,
such as phone numbers, fax numbers, email addresses, URLs, pagers, and so on.
Two Communication Channel Models
This section describes two communication channel models.
Generic, Normalized Communication Channel Model
In some data models, communication channel data are modeled as a completely generic
structure, with a type identifier that determines the type of electronic or telephonic
address contained in each row of data, as shown in the following figure.

Design Patterns
Denormalized, Type-Specific Communication Channel Model
In other data models, communication channel data is stored in a denormalized,

type-specific structure in which each communication channel is stored in its own
specific column, as shown in the following figure.
Comparison Between the Models
This section compares the pros and cons of each of these models.
Generic, Highly-Normalized Communication Channel Model
The following table compares and contrasts the advantages and disadvantages of using
a generic, highly-normalized communication channel model.
Generic, Highly Normalized Communication Channel Model
PROS CONS
No restriction on the number of phone More problematic to use in Customer match
numbers, e-mail addresses, and so on that can rules. Cannot do combinations of matches like
be stored for a customer. “WHERE customer phone numbers match
AND customer e-mails match”.
Duplicate values can easily be identified and More difficult to determine new versus
consolidated. updated values because there is no primary
key value for each record that does not
include the communication channel value.

Design Patterns
Generic, Highly Normalized Communication Channel Model (Cont.)

PROS CONS
New communication channel types can easily Does not support a granular level of detail for
be added without needing to change anything types. For example, all elements of a phone
in the data model. number are usually stored in one string, rather
than being separated into country code, dial
code, phone number, and extension.
Requires ETL work to normalize the data, or
otherwise requires multiple loads in MRM
through multiple staging tables.
The following table compares and contrasts the advantages and disadvantages of using
a denormalized, type-specific communication channel model.
PROS CONS
Can do match combinations like “WHERE Number of communication channels that can
customer phone numbers match AND be stored per customer is limited by the
customer emails match”. number of columns available.
Easier to determine new versus updated Can be more difficult to identify and
values. An update to a value can be treated as consolidate duplicate values. For example, if
a direct update because the record is keyed on Source A provides phone_1 as
customer_id and the update is to a specific 555-123-4576 but Source B has that value
column. as phone_2 and not phone_1, then the two
values will not be de-duped.
Supports a more granular level of detail for New communication channel types require
types. For example, instead of storing all changes to the data model.
elements of a phone number in one string,
they can be separated into country code, dial
code, phone number, and extension.
Straight mapping from landing to staging can
be supported for the relevant communication
channels, with the need for normalizing the
data in ETL or through multiple stage jobs in
MRM.

Design Patterns
Proposed Third Type of Data Model for Communication

Channels
This model shown in the following figure is a hybrid of the other two models.
It provides normalized structures for communications channels without a high degree
of generalization described in “Generic, Normalized Communication Channel Model”
on page 46.
In this hybrid model, separate communication channel objects have been created for
logically similar communication types. This approach minimizes the drawbacks of the
two previous models and provides the best benefits of both.

Design Patterns

4
Using Trust Settings and Validation Rules
Trust is a designation of confidence in the relative accuracy of a particular piece of

data. Validation determines whether a particular piece of data is valid. Trust and
validation work together to determine “the best version of the truth” among multiple
sources of data. This chapter provides a brief overview of how trust settings and
validation rules work together, best practice recommendations, and examples.
This chapter is recommended for administrators and implementers.
Chapter Contents
• Using Trust Levels
• Using Validation
• Using Trust and Validation Together
51
Using Trust Levels
Using Trust Levels

This section describes how to determine appropriate trust levels for an individual piece
of data coming from a given data source. You use the Trust tool in the Hub Console to
configure trust levels. To learn more, see the Siperian Hub Administrator’s Guide.
About Trust Levels

Trust is a designation the confidence in the relative accuracy of a particular piece of
data. For each column from each source, you can define a trust level represented by a
number between 0 and 100, with zero being the least trustworthy and 100 being the
most trustworthy. By itself, this number has no meaning. It becomes meaningful only
when compared with another trust number to determine which is higher.
Trust is used to determine:

• Survivorship when two or more records are merged (in case of a group merge).
• Whether updates from a source system are reliable enough to update the “best
version of the truth” record.
MRM’s on-going management of the “best of breed” record is achieved using the
trust rules to assess updates from source systems in terms of their trust weightings.
How Trust Works

In a merge, Siperian Hub calculates a trust score for both records being merged
together (merge source and merge target). Siperian Hub compares the trust score of the
merge source with the trust score of the merge target and changes the survived value in
the base object only if the merge source has a higher trust score than the merge target.
If the trust score of the merge target is higher, then the value of the merge target
remains unchanged.
Consider the following example. When two base object records merge, MRM calculates
the trust score for each trusted column in the two base object records being merged.

Using Trust Levels
Cells (the intersection of a column and record) with the highest values survive in the
final merged record.
Calculations
When an update comes in from a source system, MRM calculates the trust score on the
incoming data and compares it to the trust score of the data in the base object.
Using Trust Settings and Validation Rules 53

Using Trust Levels
Updates are applied to the base object only for cells that have the same or higher trust
score on the incoming data.
How Decay Periods Affect Trust Levels
Depending on the configured decay period specified, a small difference (such as one
day) in the age of two records does not affect survivorship immediately, especially if the
merge date is very close to the src_lud (not much time has passed for the trust level
to move down the curve). With Linear decay, the impact of age remains constant. With
RISL and SIRL, the impact of age changes as the trust level moves down the curve.
However, the way you specify the time units does affect trust levels. The more
granularly you specify the time units, the more sensitive the graph is to small changes in
age, although that sensitivity does decrease with longer decay periods. For example, the
following table shows trust settings based on different ways to configure the decay

Using Trust Levels
period. For all of these examples, the maximum trust setting is 90 and the minimum
trust setting is 10.
How Time Settings for Decay Period Affect Trust Levels
Decay Period Graph Type Trust Level
One year RISL trust = 90
12 months RISL trust = 89.60
365 days RISL trust = 89.56
1 year linear trust = 90

12 months linear trust = 89.8
365 days linear trust = 89.78
3 years linear trust = 90 after one day

36 months linear trust = 89.93 (actually 89.9333, but the system
rounds to two decimal places)
1095 days linear trust = 89.93 (actually 89.9269)
Ranking Source Systems According to Trustworthiness

Before you define trust settings, you must analyze the data source systems and rank
them in descending order of reliability. The goal is to define the relative (not absolute)
level of reliability of data in these source systems. Ranking is by attribute. For each
attribute, the ranking of source systems might differ. Levels need not be
exclusive—you can have more than one system rated at the same level.
Consider doing this process on a whiteboard. List the attributes and assign them either
a straight ranking for each attribute or a group of related attributes (such as address
data).
When ranking the reliability of source systems, consider the following issues:
• What are the processes for updating the source data? For example, if the source
system has three screens for updating all of the data, then data on the first, most

Using Trust Levels
frequently-used screen is likely to be updated more frequently than values on any

subsequent screens.
• What information goes into each source system? How is data validated? What is
the process for updating data? Do the attributes that you want to bring into
Siperian Hub exist in the source system (if so, then you might encounter a lot more
unwanted or incorrect data)? How clean is the data in the source system and how
clean can the data be made by removing junk data? It is important to understand
what the source systems and your ETL process are doing to cleanse data in the
source system.
• Look at systems that are highly rated. Are there conditions that you define as part
of the data analysis that result from the most reliable source? Note those
conditions as part of your analysis.
• Focus on one base object at a time. Within the base object, focus on each trusted
attribute. Rank the source systems for that attribute according to their relative
trustworthiness.
• Ask on-site business experts and/or data specialists to provide practical knowledge
about the data sources so that you can more effectively define the trust rankings.
Consider conducting one or more trust workshops with these experts to help
clarify the trust rankings. Make sure that you document any decisions, particularly
trade-off decisions, and obtain sign-off approval from the participants.
• Analyze data for invalid conditions. Conduct a frequency analysis to determine
how often such conditions occur per source. The goal is to identify what is the
more correct data, not just the more correctly formatted data.
Note: Be sure to distinguish between invalid data conditions that can be remedied
through data cleansing and those that cannot. Consider focusing on trust and
validation rules for conditions that cannot be remedied.
• Determine which columns require trust settings and which do not. You should
define trust on a column if any of the following conditions apply:
• There are two or more data sources for that column and they are not equally
reliable (or equally unreliable).
• The Last Update Date must be taken into account in determining
survivorship.

Using Trust Levels
• A data steward must be able to select or promote the surviving trust value in
the Merge Manager / Data Manager (to learn more, see the Siperian Hub User’s
Guide).
• Consider the performance impact of configuring trust columns.
• The larger the number of configured trust columns and validation rules, the
longer it takes to complete the load and merge processes.
• The larger the number of trusted columns, the longer it takes to complete the
update statements for the control table. Oracle has a 32K limit on the size of
the SQL buffer for SQL statements. For this reason, more than 60 trust
columns result in a horizontal split in the update of the control table
(consequently, MRM will try to update only 60 columns at a time).
• Identify logical trust groups in the data and assign them all the same trust levels, as
well as validation downgrades.
• For example, address fields should all belong to the same logical trust group so
that all parts of the address are always taken from the same source record. This
is because the granular components of an address are dependent on each other
for their meaning. Nonsensical addresses could result if parts of the address
were taken from one source and other parts of the address were taken from a
difference source.
Note: The logical trust group for address should include a validation status
indicator if it is being used to determine a downgrade percentage in a
validation rule.
• Names (First Name, Middle Name and Last Name) usually do not belong to a
logical trust group. This is because components of a full name are not
dependent on each other for their meaning. A source system that provides, for
example, good information on last names might provide only an initial letter
for middle name, while another source system that provides lower-quality last
names might provide full and valid middle names.
• Siperian Hub handles delete flags in two different ways:
• direct delete—Delete-flagging any cross-reference for a base object will result
in the base object record being flagged for delete as well.
• consensus delete—A base object record is flagged as fully inactive only if all
of its cross-reference records are flagged as deleted. In this model, the base

Using Trust Levels
object records that have some but not all cross-references flagged as deleted
are flagged as partially deleted.
For delete flags for consensus trust:
• "P" (Partial Delete) must have a lower trust than "I" (Inactive) or "A"
(Active).
• "I" and "A" should be at the same level.
Add a validation rule to downgrade the trust score if the delete flag is "P" or
"I". Do not preserve minimum trust. For more information, see the Hard
Delete Detection technical bulletin.
Trust Best Practices

Trust values are run-time calculations. Trust is planned in the Discover and Design
phases and verified and fine-tuned in the Build phase. To learn more, see “Phases in an
Siperian Hub Implementation Project” on page 6.
Choosing the correct trust levels is a complex process. It is not enough to consider one
system in isolation. You must ensure that the trust settings for all of the source systems
that contribute to a particular column combine to produce the behavior that you want.
1. During the Discover phase, talk with as many people as possible about the data.
2. Use the Data Quality Audit questionnaire in the Analysis phase. Question the
system owners, including maintenance, data steward, and sales liaison
representatives.
• For each table/file, determine the table/file name, the total number of records
in the inspected set, and the total number of records in the full data
population.
• For each column in the table, determine the column name, number of distinct
values, number of NULL values, percentage of NULL values, text length
(maximum, minimum, and average), types of non-alphanumeric characters
found, number of values that indicate “unknown” or “undefined”, the top ten
values (the ten values that occur most frequently), and any other notes
regarding your visual inspection of the data.

Using Trust Levels
3. Use the Trust Matrix to record all relevant information that goes toward
determining trust settings for each source system. The Trust Matrix asks a number
of questions about the source data. Each question is designed to elicit information
about the probable reliability of the source system. Here are some of the questions
that you should consider:
• Does the source system validate this data value? How reliably does it do this?
• How important is this data value to the users of the source system, as
compared with other data values? Users are likely to put the most effort into
validating the data that is central to their work.
• How frequently is the source system updated?
• How frequently is a particular attribute likely to be updated?
4. Rank the systems in relation to the source system of highest trust based on the
attributes that will be used.
5. For each column in each base object table, you can enable or disable trust using the
Trust tool in the Hub Console.
• If trust is disabled, Siperian Hub will always use the most recently loaded
value, regardless of which source system it comes from.
• For most columns that come from multiple source systems, you will want to
enable trust because some systems are more reliable sources of that particular
information than others. If you enable the trust for a column, you also specify
the trust settings for each source system that could update the column.
6. If you expect a data steward to override settings of sources, enable trust and use a
special source system called “Admin” that represents manual updates that the data
steward makes within Siperian Hub. This source system can contribute data to any
column that has trust enabled on using the Trust tool. You must specify trust
settings for the Siperian Admin system. You will probably want to set the trust
settings for this system to high values to ensure that your manual updates override
any existing values from source systems.
7. Trust and validation can cause situations in which values survive in the base object
even though they are no longer in any of the cross-references. Validation
downgrade can mean that the source does not update a cell even if it had
previously provided the cell value. The survived value in a base object might not
have same value as the corresponding cross-reference. There might not be any

Using Trust Levels
cross-references with the same value or trust as the base object. This situation
causes problems in the following areas:
• Delete indicators – making sure the right value is in the base object
• Removing the influence of inactivated records from base object
Configuring Trust Levels

This section covers issues associated with configuring trust levels. You use the Trust
tool in the Hub Console to configure trust levels. To learn more, see the Siperian Hub
Administrator’s Guide.
Guidelines for Configuring Trust Settings
Consider the following guidelines for configuring trust settings:

• If a column receives data from multiple sources, then enable trust for that column.
This also requires you to specify the relative trust level for each of the source
systems that update the column.
• If you are doing a lot to clean data from one source and not another, reduce trust
for the data source that requires more cleaning, after receiving the appropriate
approvals from the business.
• If you set a long decay period for data, you might have difficulty picking up small
fluctuations in the trust level. You must balance this consequence against your
reasons for setting a long decay period.
• Some groups of data form logical trust groups. For example, the components of an
address forms a logical trust group. All the elements of an address must have the
same settings: trust codes, decay values, and so on. You do not want to pick up
pieces of an address from different sources. Also, if a postal service database
returns an indicator that some part of the address data is invalid, then grouping the
data means that all parts of the address will be downgraded the same amount.
• With staging tables, if you have logical trust groups, enable the Allow Null Update
flag for the members of that group. For example, suppose an Address Line 2
column contains the value Suite 2 and then a user corrects the record by
removing the Suite 2 value. If Allow Null Update is not enabled for that column,
then the Suite 2 value would remain in the cell, resulting in an inaccurate record.

Using Trust Levels
• Avoid assigning numbers that are too close together. Make sure that you set trust
levels far enough apart (a minimum difference of five; ten is better) to avoid
rounding problems that might occur during trust calculations. In the course of
calculating trust as it degrades, Siperian Hub rounds these numbers and, if the
numbers are too close together, rounding errors can obscure the differences.
Defining Trust Settings
When defining trust settings, you must:

• Determine the ranking of the attributes (or groups of attributes). See “Ranking
Source Systems According to Trustworthiness” on page 55 for more information.
• Assign trust values based on these rankings. See “Guidelines for Configuring Trust
Settings” on page 60 for more information.
• Assign decay values based on the analysis of the continuing reliability of the data.
See “Enabling Cell Update” on page 62 for more information.
The following example shows ranking source systems for customer name.
Source Title First Middle Last Suffix

Sys1 80 90 60 90 90
Sys2 60 75 90 80 60
Sys3 95 80 80 60 80
To define trust settings:

1. Review your data source analysis and notice the criteria you noted that
distinguished the highly-rated systems. The criteria that result in the most reliable
sources become the validation rules (see “Using Validation” on page 65). Using
these criteria, you can make sure that data from sources that conform to those
rules prevail over less reliable data sources.
2. Quantify these rules by applying a numerical designation of trust to those source
systems using a scale of 0 (lowest trust) to 100 (highest trust). Remember that
these numbers have no meaning in themselves—they are meaningful only in the
relative ranking of the source systems in relation to each other.
3. Once you have identified the validation rules, define the decay type and rate.

Using Trust Levels
The most common decay type is SIRL (Slow initial, Rapid Later). This decay type
makes the most sense for most data.
Another common scenario is where data that comes from the source system (in
the form of updates) must always prevail over the existing data. In this case,
consider disabling trust. This will guarantee that the newest incoming data from
the source system will overwrite the data already in the MRM.
4. Define maximum / minimum trust settings and decay curves.
To do so, identify the cross-over points where decay curves would intersect each
other. Leave a buffer at the top and bottom of the ranges (avoid setting the
maximum trust to 100 or the minimum trust to 0). Leave a buffer between source
systems as well. This buffer makes it easier to tweak trust settings and to add more
sources later. A suggested gap between settings is at least five or more, preferably
10.
Enabling Cell Update
The default behavior for when Siperian Hub receives an updated value for a column on
a record from a source system is that all trust values for the trusted columns for that
source are recalculated from maximum trust again, based on the last update date of the
record. Because Siperian Hub does not check to see whether the actual cell values have
changed, an update in one column is regarded as enforcing the values in other columns.
This restarts the decay curve for all the values for the record from the beginning. If you
want Siperian Hub to check whether the actual column value has changed before
updating the column and recalculating its trust level from the Maximum Trust, then
enable cell update using the Schema Manager in the Hub Console (to learn more, see
the Siperian Hub Administrator’s Guide).
Enable cell update on your staging table if you have parts of the record coming from
source systems that are regularly updated, and other parts of the record that are not
regularly updated. Generally, users never look at the parts that are not regularly
updated. It is a good idea to enable cell update so that these parts of the record carry
on decaying, while the updated bits have their trust values reset appropriately.
For example, suppose a source system has three screens for updating all the data.
Anything that is not on the first, most frequently-used screen is probably updated

Using Trust Levels
much less frequently. In this case, enabling cell update allows the trust value for these
infrequently updated cells to continue to decay.
Example Stored Procedure to Calculate Decayed Trust

The following code shows an example stored procedure to calculate decayed trust.
Use this code if you want to get the calculated trust value for a particular cell.
DECLARE
RetVal NUMBER;
IN_PREV_UPDATE_DATE VARCHAR2(200);
IN_LATEST_UPDATE_DATE VARCHAR2(200);
IN_TRUST NUMBER;
IN_MIN_TRUST NUMBER;
IN_TIME_UNITS VARCHAR2(200);
IN_GRAPH_TYPE NUMBER;
IN_X_MAX NUMBER;
BEGIN
IN_PREV_UPDATE_DATE := '11 OCTOBER 2005';
IN_LATEST_UPDATE_DATE := '12 OCTOBER 2005';
IN_TRUST := 90;
IN_MIN_TRUST := 10;
IN_TIME_UNITS := 'YYYY'; -- 'YYYY' or 'M' OR 'D'
IN_GRAPH_TYPE := 2;
IN_X_MAX := 3;
RetVal := CMX.CALC_DECAYED_TRUST ( IN_PREV_UPDATE_DATE, IN_LATEST_

UPDATE_DATE, IN_TRUST, IN_MIN_TRUST, IN_TIME_UNITS, IN_GRAPH_TYPE,
IN_X_MAX );
DBMS_OUTPUT.Put_Line('RetVal = ' || TO_CHAR(RetVal));
COMMIT;
END;
In this example:
Name Description
IN_PREV_UPDATE_DATE Previous updated date.
IN_LATEST_UPDATE_DATE Date to calculate the decayed trust score.

Using Trust Levels
Name Description
IN_TRUST Maximum trust level.
IN_MIN_TRUST Minimum trust level.
IN_TIME_UNITS Time unit used for decay period.
IN_GRAPH_TYPE Decay period type. One of the following types:
• 0 – slow initial, rapid later
• 1 – rapid initial, slow later
• 2 – linear
IN_X_MAX Number of decay units.
EDU_S330_COMPLETE ORS name.
If you add SET SERVEROUTPUT ON in the beginning of the code, when executing
it using SQL*PLUS, the trust score on a specific date would be calculated and printed
out.

Using Validation
Using Validation
This section describes how to use validation rules to determine the validity of an
individual piece of data coming from a given data source. You use the Schema Manager
in the Hub Console to configure validation rules. To learn more, see the Siperian Hub
About Validation Rules

A validation rule tells MRM the condition under which a data value is not valid. If data
meets the criterion specified by the validation rule, then the trust value for that data is
downgraded by the percentage specified in the validation rule. If the Reserve Minimum
Trust flag is set for the column, then the trust score cannot be downgraded below the
column’s minimum trust setting.
How Validation Works

If you set validation rules with trust settings, cells that meet the condition defined in
the validation rule have their trust scores downgraded by the percentage downgrade
value specified for the validation rule according to the following algorithm.
Final trust = Trust - (Trust * Validation_Downgrade / 100)
For example, with a validation downgrade percentage of 50%, and a trust level
calculated at 60:
Final Trust Score = 60 - (60 * 50 / 100)
Therefore:
Final Trust Score = 60 - 30 = 30
Validation rules are evaluated in sequence, and the last validation rule that is met
provides the validation downgrade that is applied. The order of the validation rules is
therefore important. For example, the following two validation rule lists have different
results for the same input data.

Using Validation
Sequence 1
1. 'Downgrade trust on First_Name by 50% if Length < 3’
2. 'Downgrade trust on First_Name by 75% if Delete_ind=Y'
Sequence 2
1. 'Downgrade trust on First_Name by 75% if Delete_ind=Y'
2. 'Downgrade trust on First_Name by 50% if Length < 3'
For a given record that is flagged as deleted and where the value in the First_Name
column is 'MK', the final trust score for each of the lists given above are calculated as
follows:
• Sequence1: Final Trust Score = (Trust - (Trust * 75 / 100))
If Trust was calculated as 60, then for Sequence 1, Final Trust = (60 - 45) = 15.
• Sequence 2: Final Trust Score = (Trust - (Trust * 50 / 100))
If Trust was calculated as 60, then for Sequence 2, Final Trust = (60 - 30) = 30.
If it is more important that the trust score be downgraded for deleted records than for
records with short first names, then obviously the first scenario is the best approach to
use.
Differences Between Batch and Online (API) Processing
Validation rules are evaluated differently for batch and online (API) processing, which
can result in different outcomes for downgrading trust levels for a given column or
piece of data.
• Batch Process—Validation rules are evaluated from top to bottom. The process
steps through all the rules and only the last applicable rule is applied.
• API call (PUT)—Validation rules are applied in the reverse order. The first rule that
meets the validation criteria is the only one applied (evaluation stops at that point).
Ordering and Grouping Validation Rules
The order of the validation rules is very critical. Validation rules should be ordered
starting with the rules that have the lowest impact (rules resulting in the least

Using Validation
downgrade), and moving to the rules that have the highest impact (rules resulting in the
highest downgrade). In many cases, downgrades are mixed and matched across rules.
Therefore, the goal is to determine how you order them by level of severity.
Consider the following set of example validation rules.

Rule 1 - Downgrade FName by 20%, downgrade ID by 60% WHEN fieldA =
'BAD'
Rule 2 - Downgrade FName by 40%, downgrade ID by 40%, downgrade
FLAG_A by 80% WHEN FLAG_A = 'N'
Rule 3 - Downgrade FName by 10%, downgrade ID by 70% WHEN FLAG_B='N'
In this set of validation rules, note that the downgrade in rule 1 is for two columns,
whereas rule 2 has three columns for downgrade and rule 3 has two columns for
downgrade. If the situation arises in which all three rules are satisfied, then the final
outcome of the downgrade will be based on a combination, such as:
"downgrade FName by 10%, dowgrade ID by 70% and downgrade FLAG_A by
80%"
The downgrade process sequentially applies the downgrade rule that meets the
condition and stores the downgraded results in a temp table. In this example, the values
inserted will be for Rule 1, which includes only the FName and ID columns. Rule 2 will
overwrite those values for this rowid_object with FName, ID and FLAG_A. Rule 3 will
then overwrite the same record with values only for columns FName and ID. This
processing results in the downgrade values that go across rules.
If all the downgrade rules are met, then only the values from one downgrade rule per
column (not always the same one) will be applied. Therefore, the downgrade values are
not cumulative.
The grouping and ordering of the downgrade rules should be done by grouping and
defining validation rules that have the same columns. Therefore, you might end up
defining multiple rules with the same WHERE clause, which would definitely increase
the number of validation rules. The previous example would need to be broken down
as:
Rule 1 - Downgrade FName by 10% WHEN FLAG_B='N'
Rule 2 - Downgrade FName by 20% WHEN fieldA = 'BAD'
Rule 3 - Downgrade FName by 40% WHEN FLAG_A = 'N'

Using Validation
Rule 4 - Downgrade ID by 40%, downgrade FLAG_A by 80% WHEN FLAG_A =

'N'
Rule 5 - Downgrade ID by 60% WHEN fieldA = 'BAD'
Rule 6 - Downgrade ID by 70% WHEN FLAG_B='N'
Compared to the previous example, if all rules were met, then this would give us a final
result of
"downgrade FName by 40%, downgrade ID by 70% and downgrade FLAG_A by
80%"
Best Practices for Validation Rules

This section describes best practices for validation rules.
Using Cross-Column Validation
Consider how data is coming in terms of grouping of data. Do all columns come in
together from staging tables and PUTs? If not, then the validation rules are not valid.
Using Complex Validation Rules
You must have foreign keys when using complex validation rules.
Validation and Its Affect on Load and Merge Performance
Validation rules have an impact on the performance of Load and Merge jobs because
they involve running more queries and maintaining more metadata. Therefore, you
should use validation rules judiciously and only where needed. Consider the following
issues:
• Use validation rules for a column only when they are truly required.
• Limit the number of validation rules per column.
• If a Load job is slow, manually create indexes in the database on the staging table
for columns used as criteria.

Using Validation
• Joining to other tables involves a lot of overhead. If a join is necessary, join only
low volume tables. It is better to have that data be part of the ETL process than
the validation process.
Using SQL In Validation Rules
Make sure that any SQL used in a validation rule is well formed and well tuned.
For example:
• If your validation rules contains multiple conditions, enclose the validation rule in
parentheses, especially if the validation rule contains OR conditions. The SQL
fragment you define in a validation rule is appended to an existing SQL fragment
in MRM, and badly formed queries can result in unexpected results and
long-running queries
• Use the following syntax:
x IN (value1, value2, value3)
instead of the following syntax:

(x = value1 or x = value2 or x = value3)
as it is more efficient for the RDBMS to evaluate a subset than multiple OR

conditions.

Using Trust and Validation Together

This section describes using trust levels and validation rules together.
Scenarios Involving Trust and Validation for a Column

This section describes the following scenarios:
• Column with No Configured Trust Levels or Validation Rules
• Column Configured With Validation Rules But No Trust Levels
• Column Configured With Trust Levels But No Validation Rules
• Column Configured With Trust Levels and One or More Validation Rules
Column with No Configured Trust Levels or Validation Rules
If a given column has no configured trust settings or validation rules, then the most
recently loaded source value for the cell is always the winner, and the cell will be
updated in the base object. In a merge, the value from the record that MRM deems to
be the merge source will survive after the merge.
Column Configured With Validation Rules But No Trust Levels
If a given column is configured with one or more validation rules but no trust settings,
then the following will occur:
• If the validation rule is specified as 100% downgrade without the Reserve
Minimum Trust option, then a cell that meets the validation rule condition
(meaning that the data is invalid) will not survive in the base object. If there exists
no other source that can provide an update value for the cell in the base object,
then the default value specified for the cell survives in the base object. If no default
is specified, then the surviving value is NULL.
• If the validation rule is specified as something other than a 100% downgrade,
and/or if the rule has the Reserve Minimum Trust option, then the most recently
updated source value for the cell is always the winner.

Column Configured With Trust Levels But No Validation Rules
If a given column is configured with trust settings but no validation rules, then the
decayed trust score is calculated based on the last update date of the source record, and
the trust settings for the column for that system. The winning cell is the one with the
highest trust score after decay.
Column Configured With Trust Levels and One or More

Validation Rules
If a given column is configured with trust settings and one or more validation rules,
then the validation downgrade is applied for the most severe rule (defined by the
validation rule sequence in the Hub Console) that fails validation, and then the trust
score for that data is downgraded by that percentage. If the new trust score is below
the minimum trust for the rule, then the minimum trust setting is the final trust score.
Finally, the two cell trust scores are compared and the data in the cell with the highest
trust score is chosen as the winning data that updates the cell.
What Happens When a Record Is Updated

When a record is updated, the cross-reference records for the data are always updated.
The base object records are updated only by data that have higher trust levels than the
existing data in the target base object. Whenever the Load procedure updates the base
object, it also updates the control and history tables associated with the base object.
Note: Load allows a NULL value to come in only if the initial load base object has the
NULLvalue or you have allow_null_update enabled.
The cross-reference will always get updated for the source system. The base object will
get updated only if the trust score of the latest update for the cell is higher than the
trust score of the base object cell.

Example Using Trust Levels and Validation Rules Together

This section provides an example of using trust levels and validation rules together for
a column based on the “Scenarios Involving Trust and Validation for a Column” on
page 70.
When merging record A into record B, if no trust or validation settings are configured,
then all of the data from record A will be kept. This is not always desirable when there
are numerous data sources of differing levels of trustworthiness providing potential
values for the consolidated record. To achieve a goal of greater data reliability, trust and
validation must be implemented.
Consider the following data.
Record A Record B Final Output

First_Name: Mark First_Name: Mark First_Name: Mark
Middle_Name: L Middle_Name: Lawrence Middle_Name: L
Last_Name: Hoare Last_Name: Hoa Last_Name: Hoa
isRegistered: N isRegistered: Y isRegistered: N
Suppose trust and validation were enabled on all four columns and you created the
following validation rules.
Rule Name: "Downgrade trust on short Middle Name"
Rule Type: Custom
Rule Columns: Middle_Name
Rule SQL: Where Length(S.Middle_Name) < 3
Downgrade Percentage: 80
Rule Name: "Downgrade trust on short Last Name"
Rule Type: Custom
Rule Columns: Last_Name
Rule SQL: Where Length(S.Last_Name) <= 3
Downgrade Percentage: 80
To keep this example simple, assume that the source for each record is the MRM
Admin System and that the Maximum Trust is set to 90 on all columns.

Consider the trust scores after these records are loaded/inserted for the Admin
System.
Field Description
Last_Name For Record A, the trust score for Last_Name is 90 because the value
“Hoare” does not result in a trust downgrade. For Record B, the trust score
for Last_Name is (90 - (90 *80/100) = 18 after the validation downgrade.
Middle_Name For Record B, the trust score for Middle_Name is 90 because the value
"Lawrence" does not result in a trust downgrade. For Record A, the trust
score for Middle_Name is (90 - (90 *80/100) = 18 after the validation
downgrade.
The following results from merging the two records with these settings:
First_Name: Mark
Middle_Name: Lawrence
Last_Name: Hoare
isRegistered: N
Notice that the prevailing value for Middle Name was selected from the record with
the highest final trust score for that cell (Record B). The winning value for Last Name
was selected from the record with the highest final trust score for that cell (Record A).
Because validation rules were not defined for the First_Name or isRegistered columns,
the surviving values were picked from the most recently updated source record.


5
Configuring and Tuning Match Rules
This chapter provides information on how to use and tune match rules. It is
recommended for all implementers as well as Siperian Hub administrators. This section
assumes you are familiar with the material in “Match and Merge Setup” in the Siperian
Hub Administrator’s Guide.
Chapter Contents
• About Matching
• Tokens for Match Keys
• Search Strategies
• Match Purposes
• Defining and Testing Your Match Rules
• Matching Best Practices
• Exact Match Column Properties
• Setting Match Batch Sizes
• Using Dynamic Match Analysis Threshold
• Tuning Match for Performance
• About Merging
75
About Matching
About Matching
Matching is how Siperian Hub identifies data duplicates.
Before You Start Defining Your Match Rules

Before you begin the process of defining and refining your match rules, it is essential
that you are familiar with your data. You must know:
• how complete the data is. Are your base object records sparsely populated, with
many fields that are NULL?
• how clean the data is. Are you reasonably confident of the quality of the data? Is
the data that is there relatively accurate? Are there a lot of word and character
transpositions?
• what proportion of the data are likely to be duplicates? Data that has many
duplicates is referred to as matchy.
• in the columns you expect to use for matching, what is the expected variation in
the values? This expected variation is called the cardinality.
• which data is suitable for exact matching, and which is better for fuzzy matching.
Fuzzy matching takes into account variations such as word order. Exact matching
doesn’t take into account any variations, but it does have performance advantages.
Steps in the Match Process

The match process consists of the following steps:
1. Generate tokens that encode the data for searching for possible match candidates.
To learn more, see “Tokens for Match Keys” on page 77.
2. Search the data for possible match candidates.
3. Apply the match rules to the search results to return matches.
Siperian Hub uses the parameters you set for the match to generate a score that
describes the degree to which rows match. You can select a range that defines what
constitutes a match. If the score is within the range that you select, then those rows
are returned as matches. You select the range by setting the match level. To learn
more, see “Match Levels” in the Siperian Hub Administrator’s Guide.

Tokens for Match Keys
Populations
Siperian Hub uses the concept of populations to encapsulate intelligence about
customer name and address data for particular geographic groups. For example,
different countries use different formats for addresses. These differences include such
things as the placement of the street number and street name, location of the postal
code, and other variations in addresses. In addition, different populations have
different distributions of surnames. For example, US name data typically has Smith as
1% of the surnames. Other populations have other distributions. Siperian Hub uses
this intelligence to more effectively match name and address data.

A token (also called a match key) is a fixed-length compressed and encoded value built
from a combination of the words and numbers in a name or address such that relevant
variations have the same key value. For one name or address, multiple match keys are
generated. The number of keys generated per base object record varies, depending on
your data and the match key level.
Siperian Hub fuzzy matching uses tokens as a basis for searching for potential matches.
Tokens allow Siperian Hub to match rows with a degree of fuzziness - the match need
not be exact to be considered a match. The process of generating tokens is called
tokenization. Before you can use fuzzy matching, you must generate these tokens.
For example, the following strings generate the following tokens:

Example of strings and tokens
String Token
BETH O'BRIEN MMU$?/$-
BETH O'BRIEN PCOG$$$$
BETH O'BRIEN VL/IEFLM
LIZ O'BRIEN PCOG$$$$
LIZ O'BRIEN SXOG$$$-
LIZ O'BRIEN VL/IEFLM
Configuring and Tuning Match Rules 77

Note: The tokens that are generated depend on your data and the parameters you set
for match keys.
When searching for match candidates, LIZ O'BRIEN and BETH O'BRIEN are
considered as candidates because they have some key values in common.
With respect to tokens, Siperian Hub does several things during match.
• Siperian Hub checks for the tokenization incomplete indicator. If the last
tokenization process started but didn't finish, this indicator is set. If the
tokenization incomplete indicator is set, Siperian Hub re-tokenizes the data before
matching.
• Siperian Hub checks for the dirty indicator. The dirty indicator indicates that an
update occurred after the last time this data was tokenized. The dirty indicator can
propagate from a child to a parent record. A value of 0 in the dirty indicator
indicates that the record in the token table is up to date. If the record is not up to
date, Siperian Hub tokenizes the data before matching.
After generating the tokens, the next step in the match process is to get match
candidates from the database using the keys defined for the names or addresses. This is
done using the match keys generated on the column(s) selected to form the match key.
Determining When to Tokenize Your Data

You can tokenize your data at any of these times:
• when it is loaded
• when it is put into the table (using the PUT or CLEANSE_PUT API calls)
• right before you match
The default setting is to not tokenize when you load or put your data. You may want to
change this setting for either of the following reasons:
Do not use the Generate Match Tokens on Put option if you are using the API. If you
have this parameter set on, your Put and CLEANSE_PUT API calls will fail. Use the
TOKENIZE verb instead. Only turn on Generate Match Tokens on Put if you’re not

using the API and you want data steward updates from the console to be tokenized
immediately.
To learn more, see “Modifying the Properties of Base Objects” in the Siperian Hub
Match Key Widths

Siperian Hub supports the following key widths:
• Standard Keys
• Extended Keys
• Limited Keys
• Preferred Keys
These widths represent tradeoffs between the match precision and the space used by
the tokens. The space used is determined by the number of tokens generated.
For typical customer data, use the standard key width. The number can vary based on
the data, but generally the standard key width generates approximately five or six token
records per base object record.
Extended keys support more variation in the values for the key, but also generate more
records in the token table, about 10 to 12 token records for every base object record.
Limited keys support less variation in the values used for the key, but the token table is
also much smaller, with perhaps two to three token records per base object record. If
your data has character transpositions in the data used for the key, limited keys may not
be the best choice.
Preferred keys generate a single key per base record. This reduces the number of
comparisons and increases performance, but can result in returning fewer matches than
other key width options. Use this option if you have high volumes of high quality data.

Search Strategies
Match Key Types and Mixed Data

The match key type you select has a big effect on the match results.
For Party objects that include organizations and people in the same object, the match
key type must be Organization_Name, and it must be based on the full name column
from the Party object. The full name field must be populated for all records and for
individuals it should at least include first name, middle name and last name.
Search Strategies
The search strategy determines how many candidates are returned in the search phase
of the match process. The number of candidates has a direct effect on the number of
matches returned and the amount of time it takes Siperian Hub to apply your match
rules.
The search strategy used to determine the set of candidates for matching must find the
balance between finding all possible candidates, and not slowing the process with too
many irrelevant candidates.
Applications dealing with relatively clean and complete data can use a high
performance strategy, while applications dealing with less clean data or with more
critical duplication issues must use more complex strategies.
To achieve this, four search strategies or search levels are supported:

• Narrow
• Typical
• Exhaustive
• Extreme
Narrow gives the best performance but supports the least complexity, as it generates
the fewest candidates. Extreme supports the highest level of complexity but gives the
worst performance as it generates the most candidates.

Match Purposes
For typical customer data, a search strategy of typical is usually appropriate. You may
want to change this to narrow for very large data volumes or highly matchy data.
Alternately, if you have a small data set or if it critical that the highest possible number
of matching records be identified, then use the exhaustive or extreme search levels
instead.
If both performance and completeness of match are critical, then consider a 2-phase
approach to the match process: in the first phase, use a narrow search level to more
quickly match and then merge highly similar records. Then switch to a different rule set
that uses extreme or exhaustive search levels to provide the more complex and
complete searches for candidates.
Match Purposes
The match purpose describes the overall goal of a match rule. The match purpose is
very important because it determines which columns are used for matching. The list of
match purposes available is determined by the population you select. For a list and
descriptions of the standard purposes, see the Siperian Hub Administrator’s Guide.
Each match purpose supports a combination of mandatory and optional fields and
each field is weighted according to its influence in the match decision. Some fields in
some purposes may be grouped. There are two types of groupings:
• Required—requires at least one of the field members to be non-null
• Best of—contributes only the best score from the fields in the group to the overall
match score
For example, in the Individual match purpose:

• Person_Name is a mandatory field
• One of either ID Number or Date of Birth is required
• Other attributes are optional
The overall score returned by each purpose is calculated by adding the participating
field scores multiplied by their respective weight and divided by the total of all field
weights. If a field is optional and is not provided, it is not included in the weight
calculation.

Match Purposes
Using the Match Purposes to Match People

When matching people, if the match includes the address fields, then the Resident
purpose is better than the Individual purpose. However, if you want to match on
person and external IDs, don't use Resident because it requires the address
information. In that case, use Individual.
Using the Match Purposes to Match Organizations

When matching organizations, the Division purpose is better than the Organization
purpose. Organization allows organizations without addresses to match with
organizations with addresses, which may not be what you want. Division only matches
records with similar addresses.
Using the Match Purposes to Match Addresses

For match purposes that use address, don't use Address_Part_2 on its own without
Address_Part_1. If you must match on zip or city, then add an exact match column on
zip or city. Just using Address Part 2 gives you a very loose match. Alternately, you
could add a column using Postal_Area instead of exact match on city/zip.
Name Formats
Siperian Hub match has the concept of a default name format which tells it where to
expect the last name. The options are:
• Left—last name is at the start of the full name, for example Smith Jim
• Right—last name is at the end of the full name, for example, Jim Smith
The name format used by Siperian Hub depends on the purpose that you're using. If
using Organization, then the default is Last name, First name, Middle name. If using
Person/Resident then the default is First Middle Last.
Bear this in mind when formatting data for matching. It might not make a big
difference, but there are edge cases where it helps, particularly for names that do not
fall within the selected population.

Match Purposes
Field Types Used in Purposes

Below are descriptions of the fields supported by the various Match Purposes,
provided in alphabetical order.
Field Types Used in Match Purposes
Field Name Description
Address_Part1 Address_Part1 includes that part of address up to, but not
including, the locality last line. The position of the address
components should be the normal word order used in your data
population. Pass this data in one field. Depending on your base
object, you may concatenate these attributes into one field
before matching. For example, in the US, an Address_Part1
string includes the following fields: Care-of + Building Name +
Street Number + Street Name + Street Type + Apartment
Details. Address_Part1 uses methods and options designed
specifically for addresses.
Address_Part2 This is the locality line in an address. For example, in the US, a
typical Address_Part2 includes: City + State + Zip (+ Country).
Matching on Address_Part2 uses methods and options designed
specifically for addresses.
Attribute1, Attribute2 Attribute 1 and Attribute 2 are two general purpose fields. They
are matched using a general purpose string matching algorithm
that compensates for transpositions and missing characters or
digits.
Date The Date field matches any type of date, such as: date of birth,
expiry date, date of contract, date of change, creation date, etc. It
expects the date to be passed in Day+Month+Year order. It
supports the use or absence of delimiters between the date
components. Matching on dates uses methods and options
designed specifically for dates. It overcomes the typical error and
variation found in this data type.
ID The ID field matches any type of ID number, such as: Account
number, Customer number, Credit Card number, Drivers
License number, Passport, Policy number, SSN or other identity
code, VIN, etc. It uses a string matching algorithm that
compensates for transpositions and missing characters or digits.

Match Purposes
Field Types Used in Match Purposes

Field Name Description
Organization_Name The Organization_Name field matches the names of
organizations. These could be company names, business names,
institution names, department names, agency names, trading
names, etc. This field supports matching on a single name, or a
compound name (such as a legal name and its trading style). You
may also use multiple names (e.g. a legal name and a trading
style) in a single Organization_Name column for the match.
Person_Name The Person_Name field matches the names of people. Use the
full person name. The position of the first name, middle names,
and family names, should be the normal word order used in your
data population. For example, in English speaking countries, the
normal order is: First Name + Middle Name(s) + Family
Name(s). Depending on your base object design, you may
concatenate these fields into one field before matching. This
field supports matching on a single name, or an account name
(such as JOHN & MARY SMITH). You may also use multiple
names, such as a married name and a former name).
Postal_Area The Postal_Area field can be used to place more emphasis on
the postal code than if it were included in the Address_Part2
field. It is for all types of postal codes, including Zip codes. It
uses a string matching algorithm that compensates for
transpositions and missing characters or digits.
Telephone_Number The Telephone_Number field is used to match telephone
numbers. It uses a string matching algorithm that compensates
for transpositions and missing digits or area codes.
Match Levels
In in conjunction with the match purposes, you can choose one of three different
match levels. To learn more about match levels, see “Match and Merge Setup” in the

Defining and Testing Your Match Rules
Defining and Testing Your Match Rules

When defining your match rules, keep the following points in mind:
• Identify records with large numbers of similar values in the match key field. This is
called matchy data. Determine whether those records should be considered for
matching. If they shouldn't be considered for matching, then flag those records as
consolidated before running the match. To learn more, see “About Consolidation
Codes” in the Siperian Hub Administrator’s Guide. If they should be considered for
matching, then determine whether you can use the Match for Duplicate Data
functionality to quickly match and merge the records. To learn more, see “Match
for Duplicate Data” in “Match and Merge Setup” in the Siperian Hub Administrator’s
Guide.
Examples of such records are the health-care customers named 'GROUP
PRACTICES', 'MULTIPLE DOCTORS' and the zip-aligned customers where the
names are all the same with the exception of the zip numbers at the end.
Such records all generate the same match keys, resulting in an enormous pool of
records for the match to compare against each other, significantly skewing the
match data set and negatively affecting performance.
• For a Party base object - i.e. one that contains both organizations and individuals:
create different rules for organizations and individuals based on a customer type /
customer class indicator. For each rule, use a match purpose that is appropriate for
the customer type.
• If you can do an exact match on an attribute, then include that attribute as an exact
match column. It makes a significant difference to performance, as it acts as a filter
on the match. If all your columns are fuzzy match columns, then you're not going
to get great performance.
• If suffix (Jr., Sr., II, III, etc.) is important in the match, then define it as an exact
match column and switch on null matching on that column. If it is only a part of
the full name used in Siperian Hub matching then you end up matching records
that do not have a suffix with records that do have a suffix.
• If you do not have much variation in the values in the column you're keying on (i.e.
low cardinality column) or if there is little in the way of misspellings and character
transpositions in the data then use and exact match base object instead of a fuzzy
match base object. The performance is significantly better. An example of an ideal

Matching Best Practices
candidate for an exact match base object is an External Identifier base object that
stores identifiers such as social security numbers, license numbers, etc.
About Testing
For prototyping and testing your rules, use random data that is of both reasonable
quality and quantity. Do not build the prototype to search on made-up names in a
development database. Fabricated data will not give you an accurate picture of how
your rules will behave in a production environment. Use a random sample of real data.
Understand the business and performance needs of the match. There is a natural
conflict between performance and completeness of search. To balance these conflicting
requirements, choosing the search level with care. Test your searches using different
search levels on real production data.
When making your judgement, consider measures of completeness (the percentage of

known matches found) against measures of performance (how long the search
transaction or batch job took). Choose the one that best conforms to your business
requirements.
When measuring match completeness, it is best to have a known set of expected search
results. When measuring performance, in addition to ensuring the actual production
volume of data is being searched, also take into account network and machine load
overhead.
Matching Best Practices

Keep these considerations in mind as you define and tune your rules:
• The more fields and columns you can give Siperian Hub the better, as that helps it
get better, and sometimes more, matches. These additional fields provide
additional context for the match. This context allows Siperian Hub to make
decisions about which columns in a match have higher or lower levels of
importance in determining the outcome of the match.
• If you add exact match columns, these columns have a filtering effect. Exact match
columns are applied before fuzzy columns are considered. They result in the set of

Exact Match Column Properties
match candidates being reduced to only those records that have the same value in
the exact match column as at least one other record in the exact match set.
• If you add fuzzy match columns, these columns do not have a filtering effect on
the match. They do not filter out matches on their own; the match engine evaluates
all fuzzy columns before determining whether two records are a match.
• Avoid breaking composite values down into their constituent parts, as doing so
removes much of the context information that Siperian Hub can derive from the
way in which the elements in the composite value are defined. For example, pass a
person's full name with as much detail in that field as possible - first name, middle
name, last name, suffix, etc. - instead of parsing that field out into first name,
middle initial, last name etc.
• Do not filter values out of the data for matching. For example, if you parse suffix
(Jr., Sr., etc.) from the full name to include it in an exact match column, don't
remove it from full name field.

For exact match column, you can specify properties that alter the standard exact match
rule behavior.
Siperian Hub supports the following properties for exact matches:

• Null Match
• Segment Match
For information on how to use the Hub Console to set these match types, see “Match
and Merge Setup” in the Siperian Hub Administrator’s Guide.
Null Match
The standard behavior of Siperian Hub matching is to treat each NULL value as a
placeholder for an unknown value. So, by default, Siperian Hub treats nulls as unequal

in a match. You can alter this behavior by enabling the Match NULLs property. When
you enable NULL Matching, you have these options:
Property Description
Disabled Regardless of the other value, nothing will match (nulls are unequal
values). Default setting.
NULL Matches If the other value is also NULL, it is considered a match. A null value
NULL is treated as a particular value in its own right. This means a NULL
value matches another NULL value, but does not match any other
value.
NULL Matches If the other value is not NULL, it is considered a match. a NULL
Non-NULL value is treated as missing data, so it matches to non-NULL data.
Use null match in cases where you don’t lose any data if you have a null value. For
example, null match makes sense for middle name or suffix columns. Conversely, do
not use null match in cases where null match could produce an incorrect result. For
example, null match is generally inappropriate for a first name column.
When you enable null match for a rule, the rule is rarely applied as most data has
relatively few nulls. Generally speaking, more match rules means more overhead, and
higher overhead can have an effect on performance. Typically, if you have ten match
columns, only enable null match on one or two of those columns. The best way to
determine if null matching is appropriate is to know your data and test your rules.
Note: You cannot have null match enabled on the same column you are using for a
segment match. See “Segment Match” on page 88 for more.
Segment Match
Segment matching is useful for cases where you have different classes of information in
your base object. In this case, you may need different match rules to apply to different
types of data. For example, you have a base object that contains customer information
for your medical products. This base object contains information for individual
doctors, group practices, HMOs, and hospitals. You can create a column that indicates
the type of information the record contains: individual, group, and so on. Each of these
subsets of your records is referred to as a segment.

Note: Segment matching doesn’t support recursive relationships. An example of a

recursive relationship is: a group practice is part of a clinic, which in turn is part of a
hospital.
This example is better suited to the Segment matches All Data situation. A more
common example, and a more illustrative one, is where you have organizations and
individuals in the same Customer table, and you only want to match organizations with
organizations and individuals with individuals.
A common scenario where segment matching is useful is a customer base object with
individual records as well as organization records. You want to match organizations to
organizations and to customers, and individuals to individuals. You never want to
match individuals to organizations. For example, you have these rows:
Example of segment matching
Customer Name Customer Class
ABC, Inc O
ABC Company O
Annette Curtin I
A Curtin I
You can create specific match rules for each segment, resulting in different rules for
different types of data. You can also specify the name of a segment.
Using Segment Matches All Data
Generally, segments are used to match within subsets of data. For example if you have
a column called MATCH_COLUMN_SEGMENT and its values are (“A”, “B”, “C”). To match
within the “B” segment, create a rule that only generates matches when MATCH_
COLUMN_SEGMENT = “B”. Siperian Hub only generates matches against other rows
whose segment is also B. If you turn Segment Matches All Data on, it matches all the
rows in the “B” segment against any other segment. To use the sales leads/customer
database example, if you choose Segment Matches All Data, Siperian Hub matches
sales leads against everything. If this checkbox is not selected, Siperian Hub only
matches sales leads against sales leads.

A common scenario where segment matching is useful is a base object with customer
records as well as sales lead records. You want to match leads to customers, but never
the other way around. For example, you have these rows:
Example of segment matching
Customer Name Sales Lead Flag
ABC, Inc C
AB Inc C
ABC, Inc L
You can create specific match rules for each segment, resulting in different rules for
different types of data. You can also specify the name of a segment.
If the segment matches all data option is selected, then Siperian Hub starts with the
records in the specified segment and attempts to find matches for those records in the
entire base object. For example, your base object contains both sales leads and
customers, indicated by either a C (for customer) or an L (for lead) in the segment
column. To match leads against both other leads and customers, choose segment
matching and Segment Matches All Data. If you matched this data without the
Segment Matches All Data option, you would match leads against customers, but you
would also match customers against leads, which might result in less reliable data.
Keep in mind that the segment match, and therefore the segments matches all data
option, applies to only one match rule. You can have a number of different match
rules, some which use the segment and others which don't. Using the lead and
customers example, the segment match with segment matches all data allows you to
define a looser match rule for the leads segment that allows leads to be loosely matched
to other leads as well as customers. You do not want customers to match to customers
on that same loose match rule as it could result in overmatching your customer
records. So the segment limits the loose match rule just to leads, but would not restrict
leads from only matching to other leads.

Setting Match Batch Sizes
Using Matching on Dependent Tables

If you have parent and child objects and you wish to match the child objects, you must
include the parent object’s ROWID in all match rules for the child object. If you do not
do so, you will lose data. For example:
You have a parent table, COMPANY and a child table, ADDRESS. To match
addresses within a company without including ROWID_COMPANY in all match rules
causes you to lose a company's address with each merge. For example, the child table
includes these rows:
Example of Segment Matching with Child Tables
COMPANY_ROWID Address
12345 100 Main St
54321 100 Main St
If you do not include the COMPANY_ROWID in all the match rules, these two rows
are merged and there is a single COMPANY_ROWID. If the remaining ROWID is
12345, the company with rowid 54321 no longer has a record in the address child table.
This data is lost.
Setting Match Batch Sizes

The match batch size is the number of records Siperian Hub attempts to match in one
group. If the total number of records considered for match and merge exceeds this
maximum match batch size, the match process performs the match in cycles. Each
cycle is limited to matching the number of records specified by this parameter.
It may seem like a good idea to use a very large match batch size. But the correct size
of the match batch depends on the cardinality of your data and the number of matches
your rules return.

Using Dynamic Match Analysis Threshold
Using Dynamic Match Analysis Threshold

Dynamic Match Analysis Threshold is a setting in the Match/Merge setup screen.
Dynamic match analysis analyzes the match process at runtime to determine if the
match process will take an unacceptably long period of time. The threshold value is
how you specify the maximum acceptable number of comparisons.
The analysis is computed by multiplying the number of records in the base match
group and the number of records in the token table that must be compared. If this
product is less than the threshold, the match proceeds. If it is greater than that
threshold, the match is not done, and a message is written to the log, noting the range
for further investigation.
Tuning Match for Performance

One of the primary culprits in poor performance is excessive numbers of comparisons.
The match process creates a list of match candidates. It is these candidates that are
then compared to determine matches. Match candidates are determined by the values
in the match columns. For example, you are matching a dataset that has pharmacies
and there are 50,000 instances of BigChain Pharmacy. Each of these 50,000 records
may be unique, but unless you reduced the set of candidates, there would be 50,000
candidate, each of which must be compared to determine matches. It is this
comparison work that directly effects performance. Controlling the number of match
candidates is they key to improving match performance.
The performance of your system is a function of many individual things. There are
some basic strategies you can use to optimize the performance of your matching:
• The single most effective thing you can do to improve performance is to know
your data. This knowledge enables you to apply the various strategies for
performance optimization and get the best results from your Siperian Hub
implementation.
• All match approaches are tradeoffs between performance and number of matches.
Be biased towards undermatching. Undermatching means that some possible
matches are missed. The reverse, overmatching, means that an excessive number of
comparisons are done, which can consume a great deal of processing time and
resources, depending on the size of the data set.

Tuning Match for Performance
• Exact matches are much more efficient than fuzzy matches. Where possible, run
exact matches to reduce the number of candidate rows before running fuzzy
matches. If you have very matchy data, build a match rule set that has only exact
rules. Run this rule set to get rid of a large number of matches.
• If you have high volumes of high quality data, using the Preferred key width can
improve performance. However, if the quality of the data is lower, this will result in
a possibly unacceptable level of undermatching.
• If your ROWID objects are monotonically increasing, then using the Match Only
Previous ROWID option in the Match/Merge Setup screen can improve
performance. When this option is set, match comparisons are done only
downwards with respect to the ROWIDs. That is row A is matched to row B, but
row B is not matched to row A. Setting this option can reduce the number of
comparisons by about half.
This option is inappropriate in the following cases:
• Records are inserted out of ROWID order
• You are using the services integration framework with this base object
• You are using user-declared ROWIDs
• If your data is appropriate for Match Only Previous ROWID Objects, use that and
also select Match Only Once. This option means that once record A has been
matched with another record, record A is not compared with any other record
again. This dramatically reduces the number of comparisons.
• As you are testing and tuning your match rules, use the Dynamic Match Analyze
Threshold option. To learn more, see “Using Dynamic Match Analysis Threshold”
on page 92.
• Avoid loose manual rules. This only moves the problem to data stewards.
• If you have a high volume of data:
• Do not have any unconditional fuzzy match rules.
• Always have some exact match filters on every rule. These filters reduce the
number of candidates for comparison.
• Create a number of almost identical fuzzy rules with different exact match
filters to reduce the number of rows that are compared for the fuzzy match.
For example, create rules such as the following:

About Merging
• full name and address + exact postcode

• full name and address + exact state
• full name and address + exact first two digits of postcode
• Consider these exact matches when you are defining your cleansing process.
Optimizing the results of the cleanse process to generate good data for the
exact matches can significantly improve both performance and the quality of
the results. It’s much easier to make these optimizations when you’re defining
your cleanse processes than it is to go back after the fact after the data is in the
base object and you’ve found match issues.
• Never create automerge rules that contain only a single match column.
About Merging
There are two types of merges: automerge, which merges all merges queued for
automerge, and manual. Manual merges require a data steward to use the merge
manager. These two types of merges are functionally the same.
For all merges, there are two records, the source and the target. When you're merging
A into B, A is the source and B is the target. The only field that is guaranteed to survive
the merge is the ROWID of the source.
When the records are merged, all that matters is which is the source, and which is the
target. For the purposes of merge, trust on columns doesn’t apply. For non-trusted
columns, the source data always survives (and the target data is subsumed). The only
time the source data doesn't survive is when the validation rule is 100% downgrade,
0% minimum reserve trust. In this case, the target field prevails.

6
Implementing Hierarchy Manager
This chapter describes information that implementers need to know before beginning a
Hierarchy Manager™ (HM) implementation project. It is recommended for all
implementers.
Every implementation is unique. Therefore, neither this chapter nor any other can give
you exact, detailed instructions for your particular situation. This chapter:
• defines the concepts required for Hierarchy Manager
• outlines the methodology for implementing Hierarchy Manager
• describes design patterns in terms of various common requirements
• explains how to configure your Hierarchy Manager implementation
Chapter Contents
• About Hierarchy Manager
• Before You Begin Implementing Hierarchy Manager
• About Implementing a Hierarchy Manager System
• Step 1: Analyze Your Data
• Step 2: Build the Data Model
• Step 3: Configure Your Hierarchy Manager Implementation
• Step 4: Load Data
95
About Hierarchy Manager
About Hierarchy Manager

Typically, customer relationship data is stored in a variety of different applications and
data warehouses, depending on the business need, making it difficult to view and
manage customer relationship data. Each application has a well-defined
hierarchy—such as customer-to-account, sales-to-account or product-to-sales—suited
for operational purposes and often managed well by the application. Meanwhile, each
data warehouse and data mart is designed to reflect relationships necessary for specific
reporting purposes, such as sales by region by product over a specific period of time.
Different groups within the organization view a given customer in different and
incomplete ways because the application they use has a limited view of the customer
relationship and hierarchy information that is specific to that application. In addition,
each of these applications may have conflicting information and semantics.
Hierarchy Manager delivers reliable and consolidated customer relationship views that
enable organizations to navigate, analyze and manage relationships across multiple
hierarchies from disparate applications and data sources.
Hierarchy Manager is part of Siperian Hub, and builds on the power of Siperian Master
Reference Manager (MRM), leveraging MRM’s ability to provide the best version of
truth from disparate data sources and applications. Hierarchy Manager allows you to
gather, visualize, and manage relationships and hierarchies within your data set. With
this powerful tool, you can visualize the relationships in your data and use this
relationship information to more effectively cross-sell and up-sell into your existing
customer base. In addition, these relationships allow you to:
• more strategically manage accounts
• audit prospects
• align territories more accurately
• manage compensation more precisely
• get a complete relationship view of a customer (for example, a multi-generational
family)
Note: Hierarchy Manager is a part of Siperian Hub and is intended to be used with
MRM. Hierarchy Manager requires certain MRM capabilities, such as match and merge.

Before You Begin Implementing Hierarchy Manager
Before You Begin Implementing Hierarchy Manager

Hierarchy Manager is part of the Siperian platform. Before you implement HM, you
must install and configure MRM. See the Siperian Hub Installation Guide for your
platform to learn more about installing the product. See the Siperian Hub Implementer’s
Guide and the Siperian Hub Administrator’s Guide to learn more about implementing and
configuring Siperian Hub.
Before you implement your HM system, you must be familiar with MRM and
proficient in using the MRM tools. See the Siperian Hub Administrator’s Guide to learn
more about using MRM. You are also assumed to be familiar with Hierarchy Manager
and Hierarchy Manager concepts. To learn more, see
Also, in order to use Hierarchy Manager, you must have a license file from Siperian that
indicates you have purchased a license for Hierarchy Manager. To learn more about this
license, contact Siperian support.
Defining Your Goals

Before you begin designing your implementation, it is essential that you define the
goals of the HM implementation.
You must determine:

• what systems you will be using as data sources
• the data relationships you wish to explore and manage
• the hierarchies you expect to define
Understanding the Data

Before starting your HM implementation, you must have a thorough understanding of
the data you are integrating. For example, you must know the source of the data, the
data’s relative accuracy, structure, size, trends in the data, the amount of data, the
expected growth of the dataset, the relationships between the data, and any
characteristics that are peculiar to the data from each data source.
Implementing Hierarchy Manager 97

About Implementing a Hierarchy Manager System
Assembling the Team

You must also determine the people who will fill the roles required for your HM
implementation. Typically, these roles are:
• Implementation Specialists—people whose expertise is implementing
applications.
• Data Stewards— custodians of data quality. In HM terms, data stewards are the
people responsible for maintaining relationship data on a regular and ongoing
basis.
• MRM Administrators—IT people responsible for configuring or updating a Hub
Store so that it provides the rules and functionality required by the data stewards.
• Application Developers— developers that integrate the Hub Store into other
applications, such as web or CRM applications.
• DBAs— people who will maintain the database which is the basis of HM and
MRM. Since HM is database-based, DBAs contribute significantly to the design
phase of the project.
Determining Resources
Lastly, you must determine the implementation resources that will be available to you.
These resources may include:
• Message Queues
• Hardware and Network Resources

Implementing an Hierarchy Manager system is an iterative process. Since it is
impossible to have all the necessary information in hand at the beginning, things you
learn in the process cause you to go back and modify your implementation.
Implementing is a matter of designing, building, testing, modifying, and testing some
more.
While every implementation is different, there is a series of steps you will perform in
every implementation:

• Step 1: Analyze Your Data

• Step 2: Build the Data Model
• Step 3: Configure Your Hierarchy Manager Implementation
• Step 4: Load Data
• Step 5: Test and Tune the System. You must test the system to make sure it
behaves according to your needs as you have configured it.
Note: If you are implementing an MRM system and intend to also implement an
Hierarchy Manager system, it is a good idea to implement the MRM system with the
Hierarchy Manager system in mind.
It is also possible to use the Siperian Services Integration Framework (SIF) to write
applications that use Hierarchy Manager functionality. To learn more about SIF, see the
Siperian Services Integration Framework Guide.
Step 1: Analyze Your Data

As with your MRM implementation, the first step in implementing a Hierarchy
Manager system is to look closely at the data. The success of your implementation
depends on how well you understand your data.
Each of these steps requires that you examine the business requirements for each part
of the Hierarchy Manager system. It is especially important to dig deeply into these
requirements. Asking stakeholders why they want to do things the way they do is
particularly effective as often there are different ways to achieve the same functionality.
The added information about the stakeholder’s reasons will help you choose the
correct solution for your organization.
Analyzing the data includes the following steps:

1. Defining the Data Flow and Source Systems
2. Determining Entities and Entity Types

3. Determining Relationships and Relationship Types
4. Determining Hierarchies and Hierarchy Types

5. Creating a Sample Data Set for testing purposes
Defining the Data Flow and Source Systems
Determine the source systems that will feed data into Hierarchy Manager and your
MRM systems. You must know exactly what data is coming from where.
Consider the following characteristics of each data set from each source:
• type
• quality
• quantity
• source
• relationships between data from the same source
• relationships to data from other sources
• any other characteristics that are peculiar each data set
Determining the data flow provides the basis for correctly sizing your Hierarchy
Manager implementation. When determining the correct size for your system,
consider:
• the quantity of data
• the frequency of updates for that data within each source system
• how often this data will be brought into Hierarchy Manager to update the master
records
Determining Entities and Entity Types
Your research into the business needs driving your Hierarchy Manager implementation
and knowledge of the data set will result in a relatively small number of general types of
things you wish to relate.
When determining the different entity types, consider the relationships you expect
these entities to have and what the hierarchies might look like. To learn more about

entities and entity types, see “Using the Hierarchy Manager” in the Siperian Hub User’s
Guide.
Determining Relationships and Relationship Types
You must have a clear idea of the relationships you wish to manage and explore with
Hierarchy Manager. Knowing this sort of relationship data by no means precludes
discovering additional, heretofore unknown relationships via the Hierarchy Manager
tool. To learn more about relationships and relationship types, see “Using the
Hierarchy Manager” in the Siperian Hub User’s Guide.
Determining Hierarchies and Hierarchy Types
Once you have an idea of the entities and relationships you will need, it is important to
think about the hierarchies that will be built on that foundation. To learn more about
hierarchies and hierarchy types, see “Using the Hierarchy Manager” in the Siperian Hub
User’s Guide.
Creating a Sample Data Set
Implementation of your Hierarchy Manager system requires iteration to tune and

optimize your system. This is most easily done with a small, representative sample of
your data. A sample of a few thousand records or so, containing examples of each type
of entity you expect the system to support is a good starting sample. In addition, this
sample data must contain the various relationships you expect your Hierarchy Manager
system to include. You need sample data from each of your source systems. Naturally,
the more closely the sample data reflects the characteristics of the complete data set,
the more useful it will be. For example, if you have a customer database with most
customers in the United States, you might use data from just a few states.
Most of your testing will be done with this small test database. However, as your
system moves closer to deployment, test the system with larger databases. To use the
example of a customer database with mostly US addresses, you might test with data
from three or four states initially. You might then move on to testing with data from
ten states, and then twenty or twenty five.

Step 2: Build the Data Model

Once you are very familiar with the source systems, the data, the data flow, the entities,
the relationships and the hierarchies, you are ready to start designing and building the
Hierarchy Manager data model. As with the entire implementation process, building
the data model is iterative. You will learn things in the process of defining the model
and testing that will cause you to go back and modify your data model.
Note: Hierarchy Manager uses a data model that expands the one used by MRM. It is
assumed that you are familiar with MRM, its tools, and its data model. To learn more
about the MRM data model, see the Siperian Hub Administrator’s Guide and the Siperian
Hub Implementer’s Guide. To learn about building the data model, see the Siperian Hub
Step 3: Configure Your Hierarchy Manager Implementation

One of the primary features of Hierarchy Manager is its ability to allow you to visualize
your Hierarchy Manager data. You can specify how that data is displayed. To configure
your implementation, including how the data is visualized, see the Siperian Hub
Step 4: Load Data

The next step is to load data so you can test your system. Based on these tests, you can
revise and tune your Hierarchy Manager implementation for your needs. To learn about
loading data, see the Siperian Hub Administrator’s Guide.

7
Scheduling Batch Jobs and Batch Groups
This chapter explains how to schedule batch jobs and batch groups in a Siperian Hub
implementation. The information in this chapter is intended for implementers and
system administrators.
Important: You must have the application server running for the duration of a batch
job.
Chapter Contents
• About Scheduling Siperian Hub Batch Jobs
• Setting Up Job Execution Scripts
• Monitoring Job Results and Statistics
• Job Scheduling Reference
• Scheduling Batch Groups
• Developing Custom Stored Procedures for Batch Jobs
103
About Scheduling Siperian Hub Batch Jobs
About Scheduling Siperian Hub Batch Jobs

A Siperian Hub batch job is a program that, when executed, completes a discrete unit of
work (a process). All public batch jobs in Siperian Hub can be executed as database
stored procedures. To learn more about batch jobs, see the Siperian Hub Administrator’s
Guide.
In the Hub Console, the Siperian Hub Batch Viewer and Batch Group tools provide
simple mechanisms for executing Siperian Hub batch jobs. However, they do not
provide a means for executing and managing jobs on a scheduled basis. For this, you
need to execute stored procedures that do the work of batch jobs. Most organizations
have job management tools that are used to control IT processes. Any such tool
capable of executing Oracle PL*SQL or DB2 SQL commands can be used to schedule
and manage Siperian Hub batch jobs.
Setting Up Job Execution Scripts

This section describes how to set up job execution scripts for running Siperian Hub
stored procedures.
Metadata in the C_REPOS_TABLE_OBJECT_V View

Siperian Hub populates the C_REPOS_TABLE_OBJECT_V view with metadata
about its stored procedures. You use this metadata to:
• determine whether a stored procedure can be run using job scheduling tools, as
described in “Determining Available Execution Scripts” on page 107
• retrieve identifiers in the job execution scripts that execute Siperian Hub stored
procedures, as described in “Retrieving Values from C_REPOS_TABLE_
OBJECT_V at Execution Time” on page 107

C_REPOS_TABLE_OBJECT_V has the following columns:

C_REPOS_TABLE_OBJECT_V Columns
Column Name Description
ROWID_TABLE_ Uniquely identifies a batch job.
OBJECT
ROWID_TABLE Depending on the type of batch job, this is the table identifier for
either the table affected by the job (target table) or the table
providing the data for the job (source table).
• For Stage jobs, ROWID_TABLE refers to the target table
(the staging table).
• For Load jobs, ROWID_TABLE refers to the source table
(the staging table).
• For Match, Match Analyze, Autolink, Automerge, Auto
Match and Merge, External Match, Generate Match Token,
Match for Duplicate Data, and Key Match jobs, ROWID_
TABLE refers to the base object table, which is both source
and target for the jobs.
OBJECT_NAME Description of the type of batch job. Examples include:
• Stage jobs: CMX_CLEANSE.EXE.
• Load jobs: CMXLD.LOAD_MASTER.
• Match and Match Analyze jobs: CMXMA.MATCH.
OBJECT_DESC Description of the batch job, including the type of batch job as
well as the object affected by the batch job. Examples include:
• Stage for C_STG_CUSTOMER_CREDIT
• Load from C_STG_CUSTOMER_CREDIT
• Match and Merge for C_CUSTOMER
OBJECT_TYPE_CODE Together with OBJECT_FUNCTION_TYPE_CODE, this is a
foreign key to C_REPOS_OBJ_FUNCTION_TYPE.
An OBJECT_TYPE_CODE of “P” indicates a procedure that
can potentially be executed by a scheduling tool.
OBJECT_FUNCTION_ Indicates the actual procedure type (stage, load, match, and so on).
TYPE_CODE
PUBLIC_IND Indicates whether the procedure is a procedure that can be
displayed in the Batch Viewer.
Scheduling Batch Jobs and Batch Groups 105

C_REPOS_TABLE_OBJECT_V Columns (Cont.)

Column Name Description
PARAMETER Describes the parameter list for the procedure. Where specific
ROWID_TABLE values are required for the procedure, these are
shown in the parameter list. Otherwise, the name of the parameter
is simply displayed in the parameter list.
An exception to this is the parameter list for Stage jobs (where
OBJECT_NAME = CMX_CLEANSE.EXE). In this case, the
full parameter list is not shown. For a list of parameters, see “Stage
Jobs” on page 131.
VALID_IND If VALID_IND is not equal to 1, do not execute the
procedure. It means that some repository settings have changed
that affect the procedure. This usually applies to changes that
affect the Stage jobs if the mappings have not been checked and
saved again. To learn more, see “Determining Available Execution
Scripts” on page 107.
Identifiers in C_REPOS_TABLE_OBJECT_V
You use the following identifier values in C_REPOS_TABLE_OBJECT_V to execute
stored procedures.
OBJECT_ OBJECT_
TYPE_ FUNCTION_
OBJECT_NAME OBJECT_DESC CODE TYPE_CODE
CMXMM.AUTOLINK Link data in BaseObjectName P I
CMXMA.MATCH_AND_MERGE Match and Merge for BaseObjectName P B
CMXMM.AUTOMERGE Merge data in BaseObjectName P G
CMXMA.EXTERNAL_MATCH External Match for BaseObjectName P E
CMXMM.BUILD_BVT Generate BVT snapshot for P V
BaseObjectName
CMXMA.GENERATE_MATCH_ Generate Match Tokens for P N
TOKENS BaseObjectName
CMXMA.KEY_MATCH Key Match for BaseObjectName P K
CMXLD.LOAD_MASTER Load from Link BaseObjectName P L
CMXMM.MLINK Manual Link for BaseObjectName P O

OBJECT_ OBJECT_
TYPE_ FUNCTION_
OBJECT_NAME OBJECT_DESC CODE TYPE_CODE
CMXMM.MUNLINK Manual Unlink for BaseObjectName P Q
CMXMA.MATCH Match Analyze for BaseObjectName P Z
CMXMA.MATCH Match for BaseObjectName P M
CMXMA.MATCH_FOR_DUPS Match for Duplicate Data for P D
BaseObjectName
Migrate_Link_Style_to_Merge_Style CMXMA.Migrate Link Style to Merge P J
Style for BaseObjectName
CMXMM.MULTI_MERGE Multi Merge for BaseObjectName P P
CMXMA.RESET_LINKS Reset Links for BaseObjectName P W
Reset_match CMXMA.Reset Match table for P R
BaseObjectName
CMX_CLEANSE.EXE Stage for TargetStagingTableName P C
CMXMM.UNMERGE Unmerge for BaseObjectName P X
Determining Available Execution Scripts

To determine which batch jobs are available to be executed via stored procedures, run a
query using the standard Siperian Hub view called C_REPOS_TABLE_OBJECT_V,
as shown in the following example:
SELECT *
FROM C_REPOS_TABLE_OBJECT_V
WHERE PUBLIC_IND = 1 :
Retrieving Values from C_REPOS_TABLE_OBJECT_V at

Execution Time
You can use SQL statements to retrieve values from C_REPOS_TABLE_OBJECT_V
when executing your scripts at run time. The following example code retrieves the
STG_ROWID_TABLE and ROWID_TABLE_OBJECT for cleanse jobs.
SELECT a.rowid_table, a.rowid_table_object INTO IN_STG_ROWID_TABLE,

Monitoring Job Results and Statistics
IN_ROWID_TABLE_OBJECT
FROM c_repos_table_object_v a, c_repos_table b
WHERE a.object_name = 'CMX_CLEANSE.EXE'
AND b.rowid_table = a.rowid_table
AND b.table_name = 'C_HMO_ADDRESS'
AND a.valid_ind = 1;
Running Scripts Asynchronously

By default, the execution scripts run synchronously (IN_RUN_SYNCH = TRUE or
IN_RUN_SYNCH = NULL). To run the execution scripts asynchronously, specify
IN_RUN_SYNCH = FALSE. Note that these Boolean values are case-sensitive and
must be specified in upper-case characters.

This section describes how to monitor the results of batch jobs.
Error Messages and Return Codes

Siperian Hub stored procedures return an error message and return code.
Returned Parameter Description

OUT_ERROR_ Error message if an error occurred.
MESSAGE
OUT_RETURN_CODE Return code. Zero (0) if no errors occurred, or one (1) if an error
occurred.
Error handling code in job execution scripts can look for return codes and trap any
associated error messaged.
Job Execution Status

Siperian Hub stored procedures log their job execution status and statistics in the
Siperian Hub repository.

The following figure shows the repository tables that can be used for monitoring job
results and statistics:

The following provides more information about these repository tables.

Repository Tables Used for Monitoring Job Results and Statistics
Table Name Description
C_REPOS_JOB_CONTROL As soon as a job starts to run, it registers itself in C_
REPOS_JOB_CONTROL with a RUN_STATUS of 2
(Running/Processing). Once the job completes, its status
is updated to one of the following values:
• 0 (Completed Successfully)—Completed without
any errors or warnings.
• 1 (Completed with Errors)—Completed, but with
some warnings or data rejections. See the
RETURN_CODE for any error code and the
STATUS_MESSAGE for a description of the
error/warning.
• 2 (Running / Processing)
• 3 (Failed—Job did not complete). Corrective
action must be taken and the job must be run again.
See the RETURN_CODE for any error code and
the STATUS_MESSAGE for the reason for failure.
• 4 (Incomplete)—The job failed before updating its
job status and has been manually marked as
incomplete. Corrective action must be taken and the
job must be run again. RETURN_CODE and
STATUS_MESSAGE will not provide any useful
information. Marked as incomplete by clicking the
Set Status to Incomplete button in the Batch
Viewer.
C_REPOS_JOB_METRIC When a batch job has completed, it registers its statistics
in C_REPOS_JOB_METRIC. There can be multiple
statistics for each job. Join to C_REPOS_JOB_
METRIC_TYPE to get a description for each statistic.
C_REPOS_JOB_METRIC_TYPE Stores the descriptions of the types of metrics that can
be registered in C_REPOS_JOB_METRIC.
C_REPOS_JOB_STATUS_TYPE Stores the descriptions of the RUN_STATUS values that
can be registered in C_REPOS_JOB_CONTROL.

Job Scheduling Reference

This section provides a reference for the stored procedures that represent Siperian Hub
batch jobs. Siperian Hub provides these stored procedures, in compiled form, for each
Operational Record Store(ORS), for Oracle and DB2 databases. You can use any job
scheduling software (such as Tivoli, CA Unicenter, and so on) to execute these stored
procedures.
Alphabetical List of Jobs

Batch Job Description
Autolink Jobs Automatically links records that have qualified for autolinking during
the match process and are flagged for autolinking (Automerge_ind =
1). Used with link-style base objects only.
Auto Match and Executes a continual cycle of a Match job, followed by an Automerge
Merge Jobs job, until there are no more records to match, or until the size of the
manual merge queue exceeds the configured threshold. Used with
merge-style base objects only.
Automerge Jobs Automatically merges records that have qualified for automerging
during the match process and are flagged for automerging
(Automerge_ind = 1). Used with merge-style base objects only.
BVT Snapshot Jobs Generates a snapshot of the best version of the truth (BVT) for a
base object. Used with link-style base objects only.
Generate Match Token Prepares data for matching by generating match tokens according to
Jobs the current match settings. Match tokens are strings that encode the
columns used to identify candidates for matching.
Key Match Jobs Matches records from two or more sources when these sources use
the same primary key. Compares new records to each other and to
existing records, and identifies potential matches based on the
comparison of source record keys as defined by the match rules.
Load Jobs Copies records from a staging table to the corresponding target table
in the Hub Store (a base object or dependent object). During the load
process, applies the current trust and validation rules to the records.
Manual Link Jobs Shows logs for records that have been manually linked in the Merge
Manager tool. Used with link-style base objects only.
Manual Unlink Jobs Shows logs for records that have been manually unlinked in the
Merge Manager tool. Used with link-style base objects only.

Batch Job Description

Match Jobs Compares new records to each other and to existing records, and
identifies potential matches based on the current match settings.
Match Analyze Jobs Conducts a search to gather match statistics but does not actually
perform the match process. If areas of data with the potential for
huge match requirements are discovered, Siperian Hub moves the
records to a hold status, which allows a data steward to review the
data manually before proceeding with the match process.
Match for Duplicate For data with a high percentage of duplicate records, compares new
Data Jobs records to each other and to existing records, and identifies exact
duplicates. The maximum number of exact duplicates is based on the
Duplicate Match Threshold setting for this base object.
Stage Jobs Copies records from a landing table into a staging table. During
execution, cleanses the data according to the current cleanse settings.
Unmerge Jobs Updates metadata for base objects. Used after a base object has been
loaded but not yet merged, and subsequent trust configuration
changes (such as enabling trust) have been made to columns in that
base object. This job must be run before merging data for this base
object.
Autolink Jobs
Autolink jobs automatically link records that have qualified for autolinking during the match
process and are flagged for autolinking (Automerge_ind = 1). Autolink jobs are used with
link-style base objects only. To learn more, see the Siperian Hub Administrator’s Guide.
Identifiers for Executing Autolink Jobs
To learn about the identifiers used to execute the stored procedure associated with this
batch job, see “Identifiers in C_REPOS_TABLE_OBJECT_V” on page 106.
Dependencies for Autolink Jobs
Each Autolink job is dependent on the successful completion of the match process,
and the queuing of records for merge.

Successful Completion of Autolink Jobs
Autolink jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1

(Completed with Errors) to be considered successful.
Oracle Implementations
Stored Procedure Definition for Autolink Jobs—Oracle

PROCEDURE autolink (
in_rowid_table IN cmxlb.cmx_rowid
,in_user_name IN cmxlb.cmx_user_name
,out_error_message OUT cmxlb.cmx_message
,out_return_code OUT int
)
Sample Job Execution Script for Autolink Jobs—Oracle

DECLARE
IN_ROWID_TABLE CHAR(14);
IN_USER_NAME VARCHAR2(200);
OUT_ERROR_MSG VARCHAR2(2000);
OUT_RETURN_CODE NUMBER;
BEGIN
IN_ROWID_TABLE := NULL;
IN_USER_NAME := NULL;
OUT_ERROR_MSG := NULL;
OUT_RETURN_CODE := NULL;
CMXMM.AUTOLINK ( IN_ ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_
RETURN_CODE);
COMMIT;
END;
Auto Match and Merge Jobs

Auto Match and Merge batch jobs execute a continual cycle of a Match job, followed
by an Automerge job, until there are no more records to match, or until the size of the
manual merge queue exceeds the configured threshold. When executing the MATCH_
AND_MERGE job stored procedure, CMXMA.MATCH_AND_MERGE loops
Match and Automerge jobs until there are no more records to match, or until the

manual merge queue size limit is reached. Auto Match and Merge jobs are used with
merge-style base objects only. To learn more, see the Siperian Hub Administrator’s Guide.
Identifiers for Executing Auto Match and Merge Jobs
Dependencies for Auto Match and Merge Jobs
The Auto Match and Merge jobs for a target base object can either be run on
successful completion of each Load job, or on successful completion of all Load jobs
for the object.
Successful Completion of Auto Match and Merge Jobs
Auto Match and Merge jobs must complete with a RUN_STATUS of 0 (Completed
Successfully) or 1 (Completed with Errors) to be considered successful.
Stored Procedure Definition for Auto Match and Merge Jobs—Oracle

PROCEDURE CMXMA.MATCH_AND_MERGE (
IN_ROWID_TABLE CHAR(14);--Rowid of a Table.
IN_USER_NAME VARCHAR2(200);--User Name.
OUT_ERROR_MSG VARCHAR2(2000); --Error Message, if any.
OUT_RETURN_CODE NUMBER; --Return Code. (If no errors, 0 is
returned)
)
Sample Job Execution Script for Auto Match and Merge Jobs—Oracle
DECLARE
BEGIN

CMXMA.MATCH_AND_MERGE ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG,

OUT_RETURN_CODE );
COMMIT;
END;
Automerge Jobs
Automerge jobs automatically merge records that have qualified for automerging
during the match process and are flagged for automerging (Automerge_ind = 1).
Automerge jobs are used with merge-style base objects only. To learn more, see the
Identifiers for Executing Automerge Jobs
Dependencies for Automerge Jobs
Each Automerge job is dependent on the successful completion of the match process,
and the queuing of records for merge.
Successful Completion of Automerge Jobs
Automerge jobs must complete with a RUN_STATUS of 0 (Completed Successfully)

or 1 (Completed with Errors) to be considered successful.
Stored Procedure Definition for Automerge Jobs—Oracle

PROCEDURE CMXMM.AUTOMERGE (
IN_ROWID_TABLE CHAR(14); --Rowid of a Table.


OUT_ERROR_MESSAGE VARCHAR2(2000);--Error Message, if any.
OUT_RETURN_CODE NUMBER;--Return Code. (If no errors, 0 is
returned)
)
Sample Job Execution Script for Automerge Jobs—Oracle

DECLARE
OUT_ERROR_MESSAGE VARCHAR2(2000);
BEGIN
OUT_ERROR_MESSAGE := NULL;
CMXMM.AUTOMERGE ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MESSAGE,

OUT_RETURN_CODE
);
COMMIT;
END;
BVT Snapshot Jobs

The BVT Snapshot stored procedure generates a snapshot of the best version of the
truth (BVT) for a base object. It supports calculating BVT for one link group (the
group of records that are linked to one link group). BVT Snapshot jobs are used with
link-style base objects only. To learn more, see the Siperian Hub Administrator’s Guide.
When executing the BVT Snapshot stored procedure:

• IN_GROUP_ID_LIST is a list of group_ids delimited by ~ (such as ‘1~2~’).
• OUT_BVT contains the BVT values for the base object table in the following
format: ‘col1~col2~|val1~val2’
• OUT_LINEAGE contains the BVT values for the CTL table in the following
format: ‘col1~col2~|val1~val2’
• For delimited strings, the escape character is ‘\’.

Identifiers for Executing BVT Snapshot Jobs
Dependencies for BVT Snapshot Jobs
Each BVT Snapshot job is dependent on the successful completion of the Autolink job
for this base object.
Successful Completion of BVT Snapshot Jobs
BVT Snapshot jobs must complete with a RUN_STATUS of 0 (Completed

Stored Procedure Definition for BVT Snapshot Jobs—Oracle

PROCEDURE build_bvt (
in_rowid_table cmxlb.cmx_rowid
,in_group_id_list cmxlb.cmx_big_str
,in_user_name cmxlb.cmx_user_name
,out_bvt OUT cmxlb.cmx_big_str
,out_lineage OUT cmxlb.cmx_big_str
,out_bvt_count OUT int
)
Sample Job Execution Script for BVT Snapshot Jobs—Oracle

DECLARE
IN_ROWID_OBJECT CHAR(14);
IN_GROUP_ID CHAR(14);
IN_INTERACTION_ID NUMBER;
OUT_BVT VARCHAR2(2000);
OUT_LINEAGE VARCHAR2(2000);
OUT_BVT_COUNT NUMBER;

BEGIN
IN_ROWID_TABLE := 'SVR1.7S3 ';
IN_GROUP_ID := '1 ';
OUT_LINEAGE :=NULL;
OUT_BVT_COUNT :=NULL;
OUT_BVT := NULL;
CMXMM.build_BVT ( IN_ROWID_TABLE, IN_GROUP_ID, IN_USER_NAME,

OUT_BVT,OUT_LINEAGE, OUT_BVT_COUNT, OUT_ERROR_MESSAGE, OUT_RETURN_
CODE );
COMMIT;
DBMS_OUTPUT.PUT_LINE ( 'OUT_BVT= ' || substr(OUT_BVT,1,2000) );
DBMS_OUTPUT.PUT_LINE ( 'OUT_LINEAGE= ' || substr(OUT_
LINEAGE,1,2000) );
DBMS_OUTPUT.PUT_LINE ( 'OUT_BVT_COUNT= ' || substr(OUT_BVT_
COUNT,1,200) );
DBMS_OUTPUT.PUT_LINE ( 'OUT_ERROR_MESSAGE= ' || substr(OUT_ERROR_
MESSAGE,1,200) );
DBMS_OUTPUT.PUT_LINE ( 'OUT_RETURN_CODE= ' || substr(OUT_RETURN_
CODE,1,200) );
END;
Generate Match Token Jobs

Generate Match Tokens jobs prepare data for matching by generating match tokens
according to the current match settings. Match tokens are strings that encode the
columns used to identify candidates for matching. To learn more, see the Siperian Hub
Schedule Generate Match Tokens jobs if you run the load process without data
tokenization, or if match failed during tokenization. The Generate Match Tokens job
generates the match tokens for the entire base object (when IN_FULL_RESTRIP_
IND is set to 1), or only the jobs that must be processed.

Note: Check (select) the Re-generate All Match Tokens check box in the Batch Viewer
to populate the IN_FULL_RESTRIP_IND parameter.
Identifiers for Executing Generate Match Token Jobs
Dependencies for Generate Match Token Jobs
Each Generate Match Tokens job is dependent on the successful completion of the
Load job responsible for loading data into the base object.
Successful Completion of Generate Match Token Jobs
Generate Match Tokens jobs must complete with a RUN_STATUS of 0 (Completed

Successfully).
Stored Procedure Definition for Generate Match Token Jobs—Oracle

PROCEDURE GENERATE_MATCH_TOKENS (
OUT_ERROR_MSG VARCHAR2(2000);--Error Message, if any.
OUT_RETURN_CODE NUMBER;---Return Code. (If no errors, 0 is
returned)
IN_FULL_RESTRIP_IND NUMBER;--Default 0, retokenize entire table
if set to 1 (strip_truncate_insert)
)
Sample Job Execution Script for Generate Match Token Jobs—Oracle

DECLARE
IN_FULL_RESTRIP_IND NUMBER;

BEGIN
IN_FULL_RESTRIP_IND := NULL;
CMXMA.GENERATE_MATCH_TOKENS ( IN_ROWID_TABLE, IN_USER_NAME, OUT_
ERROR_MSG, OUT_RETURN_CODE, IN_FULL_RESTRIP_IND );
COMMIT;
END;
Key Match Jobs

Key Match jobs are used to match records from two or more sources when these
sources use the same primary key. Key Match jobs compare new records to each other
and to existing records, and identifies potential matches based on the comparison of
source record keys as defined by the match rules. To learn more, see the Siperian Hub
Identifiers for Executing Key Match Jobs
Dependencies for Key Match Jobs
Key Match jobs are dependent on the successful completion of the Load job
responsible for loading data into the base object. The Key Match job cannot have been
run after any changes were made to the data.
Successful Completion of Key Match Jobs
Key Match jobs must complete with a RUN_STATUS of 0 (Completed Successfully).

Stored Procedure Definition for Key Match Jobs—Oracle

PROCEDURE KEY_MATCH (
OUT_RETURN_CODE NUMBER;---Return Code. (If no errors, 0 is
returned)
)
Sample Job Execution Script for Key Match Jobs—Oracle

DECLARE
IN_ROWID_TABLE VARCHAR2(200);
BEGIN
IN_USER_NAME := 'myusername';
select rowid_table INTO IN_ROWID_TABLE

from c_repos_table
where table_name = 'C_ADDRESS';
DBMS_OUTPUT.Put_Line(' Row id table = ' || IN_ROWID_TABLE);

CMXMA.KEY_MATCH ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MESSAGE,
OUT_RETURN_CODE);
DBMS_OUTPUT.Put_Line('OUT_ERROR_MESSAGE = ' || OUT_ERROR_MESSAGE);
DBMS_OUTPUT.Put_Line('OUT_RETURN_CODE = ' || TO_CHAR(OUT_RETURN_
CODE));
COMMIT;
END;
Load Jobs
Load jobs move data from staging tables to the final target objects, and apply any trust
and validation rules where appropriate. To learn more about Load jobs and the load
process, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Load Jobs
Dependencies for Load Jobs
Each Load job is dependent on the success of the Stage job that precedes it.
In addition, each Load job is governed by the demands of referential integrity
constraints and is dependent on the successful completion of all other Load jobs
responsible for populating tables referenced by the table that is the target of the load.
For Run
Base Objects Run the loads for parent tables before the loads for child tables.
Dependent Objects Run the loads for all referenced base objects before the load for the
dependent object.
Successful Completion of Load Jobs
A Load job must complete with a RUN_STATUS of 0 (Completed Successfully)

or 1 (Completed with Errors) to be considered successful. The Auto Match and Merge
jobs for a target base object can either be run on successful completion of each Load
job, or on successful completion of all Load jobs for the base object.
Stored Procedure Definition for Load Jobs—Oracle

PROCEDURE CMXLD.LOAD_MASTER (
IN_STG_ROWID_TABLE CHAR(14);--Rowid of Staging Table
IN_USER_NAME VARCHAR2(200);--DataBase User Name
OUT_ERROR_MSG VARCHAR2(2000);--Error Mesasage, if any
returned)
IN_FORCE_UPDATE_IND NUMBER;--Forced Update value. Default 0, 1
for Forced update.Notes
)

Sample Job Execution Script for Load Jobs—Oracle

DECLARE
IN_STG_ROWID_TABLE CHAR(14);
IN_FORCE_UPDATE_IND NUMBER;
BEGIN
IN_STG_ROWID_TABLE := NULL;
IN_FORCE_UPDATE_IND := NULL;
CMXLD.LOAD_MASTER ( IN_STG_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG,

OUT_RETURN_CODE, IN_FORCE_UPDATE_IND );
COMMIT;
END;
Manual Link Jobs

Manual Link jobs execute manually linking in the Merge Manager tool. Manual Link
jobs are used with link-style base objects only. Results are stored in a _LINK table.
To learn more, see the Siperian Hub Administrator’s Guide.
Identifiers for Executing Manual Link Jobs
Dependencies for Manual Link Jobs
Each Manual Link job is dependent on the successful completion of the match process
for this base object.

Successful Completion of Manual Link Jobs
Manual Link jobs must complete with a RUN_STATUS of 0 (Completed Successfully)

or 1 (Completed with Errors) to be considered successful.
When executing the Manual Link stored procedure:

• IN_MEMBER_ROW_ID_LIST contains a list of rowid_objects in the following
format, which uses the ~ delimiter: rowid_object1~rowid_object2~rowid_object~
• Insert link records for rowid_objects in the IN_MEMBER_ROWID_LIST to the
group of IN_GROUP_ID in the base object’s link table.
• Only one active link record (UNLINK_IND=1) is allowed for each rowid_object.
Stored Procedure Definition for Manual Link Jobs—Oracle

PROCEDURE mlink (
,in_member_rowid_list cmxlb.cmx_big_str
,in_group_id cmxlb.cmx_rowid
,in_rowid_match_rule cmxlb.cmx_rowid
,in_automerge_ind int
,in_interaction_id int
)
Sample Job Execution Script for Manual Link Jobs—Oracle

DECLARE
IN_ROWID_MATCH_RULE CHAR(14);
BEGIN

IN_ROWID_TABLE := 'SVR1.ELV ';

IN_ROWID_OBJECT := '11 ';
IN_ROWID_MATCH_RULE := NULL;
IN_INTERACTION_ID := NULL;
IN_USER_NAME := 'JW';
CMXMM.MLINK ( IN_ROWID_TABLE, IN_ROWID_OBJECT, IN_GROUP_ID, IN_

ROWID_MATCH_RULE, IN_INTERACTION_ID, IN_USER_NAME, OUT_ERROR_
MESSAGE, OUT_RETURN_CODE );
COMMIT;
MESSAGE,1,200) );
CODE,1,200) );
END;
Manual Unlink Jobs

Manual Unlink jobs execute manually unlinking of records that were previously linked
manually in the Merge Manager tool. Manual Unlink jobs are used with link-style base
objects only. Manual Unlink jobs ungroup the selected base object records (group
member) from the target group, and update the corresponding linkage information
from the LINK table (update the unlink_ind value to 0). Manual Unlink jobs also
unlink all the group members of the target group if the incoming in_member_rowid_
list parameter is NULL. To learn more, see the Siperian Hub Administrator’s Guide.
Identifiers for Executing Manual Unlink Jobs
When executing the Manual Unlink stored procedure:

• Active link records of rowid_objects in IN_MEMBER_ROWID_LIST are set to
be inactive (UNLINK_IND=1).
• Active link records of rowid_objects belonging to group IN_GROUP_ID are set
to be inactive if the IN_MEMBER_ROWID_LIST is not passed in.

• IN_MEMBER_ROW_ID_LIST contains a list of rowid_objects in the following

format, which uses the ~ delimiter: rowid_object1~rowid_object2~rowid_object~
Dependencies for Manual Unlink Jobs
Each Manual Unlink job is dependent on the successful completion of a previously-run

Manual Link job.
Successful Completion of Manual Unlink Jobs
Manual unlink jobs must complete with a RUN_STATUS of 0 (Completed

Stored Procedure Definition for Manual Unlink Jobs—Oracle

PROCEDURE munlink (
,in_member_rowid_list cmxlb.cmx_big_str -- delimited
by '~'
,in_group_id cmxlb.cmx_rowid
,in_interaction_id int
)
Sample Job Execution Script for Manual Unlink Jobs—Oracle

DECLARE
BEGIN
IN_ROWID_TABLE := 'SVR1.ELV ';
IN_ROWID_OBJECT := '11 ';
IN_INTERACTION_ID := NULL;

CMXMM.MUNLINK ( IN_ROWID_TABLE, IN_ROWID_OBJECT, IN_GROUP_ID, IN_

INTERACTION_ID, IN_USER_NAME, OUT_ERROR_MESSAGE, OUT_RETURN_CODE );
COMMIT;
MESSAGE,1,200) );
CODE,1,200) );
END;
Match Jobs
Match jobs check the specified match condition for the rows of a base object table and
then queue the matched rows for either automerge or manual merge. To learn more
about Match jobs and the match process, see the Siperian Hub Administrator’s Guide.
Identifiers for Executing Match Jobs
Dependencies for Match Jobs
Each Match job is dependent on new / updated records in the base object that have
been tokenized and are thus queued for matching. For parent base objects that have
children, the Match job is also dependent on the successful completion of the data
tokenization jobs for all child tables, which in turn is dependent on successful Load
jobs for the child tables.
Successful Completion of Match Jobs
Match jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1

(Completed with Errors) to be considered successful.

Stored Procedure for Match Jobs—Oracle

PROCEDURE CMXMA.MATCH (
OUT_RETURN_CODE NUMBER;--Return Code. (If no errors, 0 is returned)
IN_VALIDATE_TABLE_NAME VARCHAR2(200);--Validate Table Name
IN_MATCH_ANALYZE_IND NUMBER;--Match Analyze to Check for Matchy
Data.
)
Sample Job Execution Script for Match Jobs—Oracle

DECLARE
IN_VALIDATE_TABLE_NAME VARCHAR2(200);
IN_MATCH_ANALYZE_IND NUMBER;
BEGIN
IN_VALIDATE_TABLE_NAME := NULL;
IN_MATCH_ANALYZE_IND 0;
CMXMA.MATCH ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_

RETURN_CODE, IN_VALIDATE_TABLE_NAME, IN_MATCH_ANALYZE_IND );
COMMIT;
END;
Match Analyze Jobs

Match Analyze jobs perform a search to gather metrics about matching without
conducting any actual matching. Match Analyze jobs are typically used to tune match
rules, which is described in the Siperian Hub Implementer’s Guide. To learn more, see the

Identifiers for Executing Match Analyze Jobs
Dependencies for Match Analyze Jobs
Each Match Analyze job is dependent on new / updated records in the base object that
have been tokenized and are thus queued for matching. For parent base objects, the
Match Analyze job is also dependent on the successful completion of the data
tokenization jobs for all child tables, which in turn is dependent on successful Load
jobs for the child tables.
Successful Completion of Match Analyze Jobs
Match Analyze jobs must complete with a RUN_STATUS of 0 (Completed

Stored Procedure for Match Analyze Jobs—Oracle

PROCEDURE CMXMA.MATCH (
returned)
IN_VALIDATE_TABLE_NAME VARCHAR2(200);--Validate Table Name
IN_MATCH_ANALYZE_IND NUMBER;--Match Analyze to Check for Matchy
Data.
)
Sample Job Execution Script for Match Analyze Jobs—Oracle

DECLARE

IN_VALIDATE_TABLE_NAME VARCHAR2(200);
IN_MATCH_ANALYZE_IND NUMBER;
BEGIN
IN_VALIDATE_TABLE_NAME := NULL;
IN_MATCH_ANALYZE_IND 1;
CMXMA.MATCH ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_

RETURN_CODE, IN_VALIDATE_TABLE_NAME, IN_MATCH_ANALYZE_IND );
COMMIT;
END;
Match for Duplicate Data Jobs

A Match for Duplicate Data job searches for exact duplicates to consider them
matched. Use it to manually run the Match for Duplicate Data process when you want
to use your own rule as the match for duplicates criteria instead of all the columns in
the base object. The maximum number of exact duplicates is based on the base object
columns defined in the Duplicate Match Threshold property in the Schema Manager
for each base object. To learn more, see the Siperian Hub Administrator’s Guide.
Identifiers for Executing Match for Duplicate Jobs
Dependencies for Match for Duplicate Data Jobs
Match for Duplicate Data jobs require the existence of unconsolidated data in the base
object.
Successful Completion of Match for Duplicate Data Jobs
Match for Duplicate Data jobs must complete with a RUN_STATUS of 0 (Completed
Successfully).

Stored Procedure Definition for Match for Duplicate Data

Jobs—Oracle
PROCEDURE MATCH_FOR_DUPS (
OUT_RETURN_CODE INT;---Return Code. (If no errors, 0 is returned)
)
Sample Job Execution Script for Match for Duplicate Data

Jobs—Oracle
DECLARE
BEGIN
CMXMA.MATCH_FOR_DUPS ( IN_ ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG,
OUT_RETURN_CODE);
COMMIT;
END;
Stage Jobs
Stage jobs copy records from a landing to a staging table. During execution, Stage jobs
optionally cleanse data according to the current cleanse settings. To learn more about
Stage jobs and the stage process, see the Siperian Hub Administrator’s Guide.
Identifiers for Executing Stage Jobs

Dependencies for Stage Jobs
Each Stage job is dependent on the successful completion of the ETL process
responsible for loading the Landing table used by the Stage job. There are no
dependencies between Stage jobs.
Successful Completion of Stage Jobs
A Stage job must complete with a RUN_STATUS of 0 (Completed Successfully) or 1

(Completed with Errors) to be considered successful. On successful completion of a
Stage job, the Load job for the target staging table can be run, provided that all other
dependencies for the Load job have been met.
Stored Procedure Definition for Stage Jobs—Oracle

PROCEDURE CMXCL.START_CLEANSE(
IN_DB_TYPE_STR VARCHAR2(200); --DataBase type (Oracle/DB2)
IN_HOST_NAME VARCHAR2(200); --Database Host Name
IN_SCHEMA_NAME VARCHAR2(200);--Schema Name
IN_PORT VARCHAR2(200); --DataBase Port
IN_CONNECT_PROPS VARCHAR2(200);--Connection Properties
IN_DB_USER_NAME VARCHAR2(200);--DataBase User Name
IN_DB_PASSWORD VARCHAR2(200);--DataBase User’s Password
IN_STG_ROWID_TABLE VARCHAR2(200);--Rowid of Staging Table
IN_ROWID_TABLE_OBJECT VARCHAR2(200);--Rowid of Table Object
IN_RUN_SYNCH VARCHAR2(200);--Run Synchronize, Boolean value
(TRUE/FALSE)
OUT_ERROR_MSG VARCHAR2(2000);--Error Mesasage, if any
OUT_ERROR_CODE NUMBER;--Error Code, if any
)
Sample Job Execution Script for Stage Jobs—Oracle

DECLARE
IN_DB_TYPE_STR VARCHAR2(200);
IN_HOST_NAME VARCHAR2(200);
IN_SCHEMA_NAME VARCHAR2(200);
IN_PORT VARCHAR2(200);
IN_CONNECT_PROPS VARCHAR2(200);
IN_DB_USER_NAME VARCHAR2(200);

IN_DB_PASSWORD VARCHAR2(200);
IN_STG_ROWID_TABLE VARCHAR2(200);
IN_ROWID_TABLE_OBJECT VARCHAR2(200);
IN_RUN_SYNCH VARCHAR2(200);
OUT_ERROR_CODE NUMBER;
BEGIN
IN_DB_TYPE_STR := NULL;
IN_HOST_NAME := 'dbhostmachine';
IN_SCHEMA_NAME := 'enterpriseschmeaname';
IN_PORT := '7001';
IN_CONNECT_PROPS := NULL;
IN_DB_USER_NAME := 'admin';
IN_DB_PASSWORD := 'mydbpassword';
IN_STG_ROWID_TABLE := NULL;
IN_ROWID_TABLE_OBJECT := NULL;
IN_RUN_SYNCH := NULL;
OUT_ERROR_CODE := NULL;
SELECT a.rowid_table, a.rowid_table_object INTO IN_STG_ROWID_TABLE,

IN_ROWID_TABLE_OBJECT
FROM c_repos_table_object_v a, c_repos_table b
WHERE a.object_name = 'CMX_CLEANSE.EXE'
AND b.rowid_table = a.rowid_table
AND b.table_name = 'C_HMO_ADDRESS'
AND a.valid_ind = 1;
CMXCL.START_CLEANSE ( IN_DB_TYPE_STR, IN_HOST_NAME, IN_SCHEMA_NAME,

IN_PORT, IN_CONNECT_PROPS, IN_DB_USER_NAME, IN_DB_PASSWORD, IN_STG_
ROWID_TABLE, IN_ROWID_TABLE_OBJECT, IN_RUN_SYNCH, OUT_ERROR_MSG,
OUT_ERROR_CODE );
dbms_output.put_line(' Message is = ' || out_error_msg);
COMMIT;
END;
Unmerge Jobs
For merge-style base objects only, the Unmerge job can unmerge already-consolidated
records, whether those records were consolidated using Automerge, Manual Merge,
manual edit, Load by Rowid_Object, or Put Xref. The Unmerge job succeeds or fails as
a single transaction: if the server fails while the Unmerge job is executing, the unmerge
process is rolled back.

Cascade Unmerge
The Unmerge job performs a cascade unmerge if this feature is enabled for this base
object in the Schema Manager in the Hub Console. With cascade unmerge, when
records in the parent object are unmerged, Siperian Hub also unmerges affected
records in the child base object. To learn more, see the Siperian Hub Administrator’s
Guide.
Unmerging All Records or One Record
In your job execution script, you can specify the scope of records to unmerge by
setting IN__UNMERGE_ALL_XREFS_IND.
• IN__UNMERGE_ALL_XREFS_IND=0: Default setting. Unmerges the single record
identified in the specified XREF to its state prior to the merge.
• IN__UNMERGE_ALL_XREFS_IND=1: Unmerges all XREFs to their state prior to the
merge. Use this option to quickly unmerge all XREFs for a single consolidated
record in a single operation.
Linear and Tree Unmerge
In your job execution script, you can specify the type of unmerge (linear or tree
unmerge) by setting IN_TREE_UNMERGE_IND:
• IN_TREE_UNMERGE_IND=0: Default setting. Linear Unmerge
• IN_TREE_UNMERGE_IND=1: Tree Unmerge
The rest of this section describes these two types of unmerges.
Linear Unmerge
Linear unmerge is the default behavior. During a linear unmerge, a base object record is
unmerged and taken out of the existing merge tree structure. Only the unmerged base
object record itself will come out the merge tree structure, and all base object records
below it in the merge tree will stay in the original merge tree.

Tree Unmerge
Tree unmerge is an optional alternative. A tree of merged base object records is a

hierarchical structure of the merge history, reflecting the sequence of merge operations
that have occurred. The merge history is kept during the merge process in the
following tables:
• HMXR provides the current state view of merges
• HMRG table provides a hierarchical view of the merge history, a tree of merged
base object records, as well as an interactive unmerge history.
During a tree unmerge, you unmerge a tree of merged base object records as an intact
sub-structure. A sub-tree having unmerged base object records as root will come out
from the original merge tree structure.
Identifiers for Executing Manual Unmerge Jobs
Dependencies for Manual Unmerge Jobs
Each Manual Unmerge job is dependent on data having already been merged.
Successful Completion of Manual Unmerge Jobs
A Manual Unmerge job must complete with a RUN_STATUS of 0 (Completed

Stored Procedure Definition for Manual Unmerge Jobs—Oracle

PROCEDURE unmerge (
,in_rowid_system cmxlb.cmx_rowid
,in_pkey_src_object cmxlb.cmx_pkey_src_object

,in_tree_unmerge_ind INT
,in_rowid_job_ctl cmxlb.cmx_rowid
,in_interaction_id INT
,out_unmerged_rowid OUT cmxlb.cmx_rowid
,out_tmp_table_list OUT cmxlb.cmx_big_str
,rc OUT INT
,in_unmerge_all_xrefs_ind IN INT default 0
)
Sample Job Execution Script for Manual Unmerge Jobs—Oracle

DECLARE
in_rowid_table CHAR (14);
in_rowid_system CHAR (14);
in_pkey_src_object VARCHAR2 (255);
in_tree_unmerge_ind NUMBER;
in_rowid_job_ctl CHAR (14);
in_interaction_id NUMBER;
in_user_name VARCHAR2 (50);
out_unmerged_rowid CHAR (14);
out_tmp_table_list VARCHAR2 (32000);
out_error_message VARCHAR2 (1024);
rc NUMBER;
in_unmerge_all_xrefs_ind NUMBER;
BEGIN
in_rowid_table := 'SVR1.8ZC ';
in_rowid_system := 'SVR1.7NJ ';
in_pkey_src_object := '6';
in_tree_unmerge_ind := 0; -- default 0, 1 for tree unmerge
in_rowid_job_ctl := NULL;
in_interaction_id := NULL;
in_user_name := 'xhe';
out_unmerged_rowid := NULL;
out_tmp_table_list := NULL;
out_error_message := NULL;
rc := NULL;
in_unmerge_all_xrefs_ind := 0; -- default 0, 1 for unmerge_all
cmxmm.unmerge (
in_rowid_table,

Scheduling Batch Groups
in_rowid_system,
in_pkey_src_object,
in_tree_unmerge_ind,
in_rowid_job_ctl,
in_interaction_id,
in_user_name,
out_unmerged_rowid,
out_tmp_table_list,
out_error_message,
rc,
in_unmerge_all_xrefs_ind
);
DBMS_OUTPUT.put_line (' Return Code = ' || rc);
DBMS_OUTPUT.put_line (' Message is = ' || out_error_message);
END;

This section describes how to schedule batch groups for your Siperian Hub
implementation.
About Batch Groups

A batch group is a collection of individual batch jobs (for example, Stage, Load, and
Match jobs) that can be executed with a single command. Each batch job in a batch
group can be executed singly or in parallel with other jobs. To learn important
background information about batch groups, see the Siperian Hub Administrator’s Guide.
This section describes how to execute batch groups via stored procedures using job
scheduling software (such as Tivoli, CA Unicenter, and so on). Siperian Hub provides
stored procedures for managing batch groups, as described in “Stored Procedures for
Batch Groups” on page 138. Siperian Hub also allows you to create and run custom
stored procedures for batch groups, as described in “Developing Custom Stored
Procedures for Batch Jobs” on page 145.
You can also use the Batch Group tool in the Hub Console to configure and run batch
groups. However, to schedule batch groups, you need to do so via stored procedures,
as described in this section. To learn more about the Batch Group tool, see the Siperian
Hub Administrator’s Guide.

Stored Procedures for Batch Groups

Siperian Hub provides the following stored procedures for managing batch groups:
Stored Procedure Description

cmxbg.execute_batchgroup Performs an HTTP POST to the
ExecuteBatchGroupRequest operation via the Services
Integration Framework (SIF). To learn more, see
“cmxbg.execute_batchgroup” on page 138.
cmxbg.reset_batchgroup Performs an HTTP POST to the
ResetBatchGroupRequest operation via SIF. To learn
more, see “cmxbg.reset_batchgroup” on page 140.
cmxbg.get_batchgroup_status Performs an HTTP POST to the
GetBatchGroupStatusRequest operation via SIF. To
learn more, see “cmxbg.get_batchgroup_status” on page 142.
In addition to using parameters that are associated with the correspond SIF operation,
these stored procedures require the following parameters:
• URL of the Hub Server (for example, http://localhost:7001/cmx/request)
• username and password
• target ORS
These stored procedures construct an XML message, perform an HTTP POST to a

server URL via SIF, and return the results.
cmxbg.execute_batchgroup
Performs an HTTP POST to the ExecuteBatchGroupRequest operation via SIF.
Note: This stored procedure has an option to execute asynchronously, but not to
receive a JMS response for asynchronous execution. If you need to use asynchronous
execution and need to know when execution is finished, then poll with the
cmxbg.get_batchgroup_status stored procedure. Alternatively, if you need to
receive a JMS response for asynchronous execution, then execute the batch group
directly in an external application (instead of a job execution script) by invoking the

Siperian Hub ExecuteBatchGroupRequest operation, which is described in the

Siperian Services Integration Framework Guide.
Signature
FUNCTION execute_batchgroup(
in_mrm_server_url IN cmxlb.cmx_small_str
, in_username IN cmxlb.cmx_small_str
, in_password IN cmxlb.cmx_small_str
, in_orsid IN cmxlb.cmx_small_str
, in_rowid_batchgroup IN cmxlb.cmx_small_str
, in_resume IN cmxlb.cmx_small_str
, in_asyncronous IN cmxlb.cmx_small_str
, out_rowid_batchgroup_log OUT cmxlb.cmx_small_str
, out_error_msg OUT cmxlb.cmx_small_str
) RETURN NUMBER -- return the error code
Parameters
Parameter Description
in_mrm_server_url Hub Server SIF URL.
in_username User account with role-based permissions to execute batch groups.
in_password Password for the user account with role-based permissions to execute
batch groups.
in_orsid ORS ID as specified in the Database tool in the Hub Console.
in_rowid_batchgroup c_repos_job_group.rowid_job_group
in_resume One of the following values:
• true: if previous execution failed, resume at that point
• false: regardless of previous execution, start from the
beginning
in_asyncronous Specifies whether to execute asynchronously or synchronously. One
of the following values:
• true: start execution and return immediately (asynchronous
execution).
• false: return when group execution is complete (synchronous
execution).

Returns
out_rowid_ c_repos_job_group_control.rowid_job_group_control
batchgroup_log
out_error_msg Error message text.
NUMBER Error code. If zero (0), then the stored procedure completed successfully.
If one (1), then the stored procedure returns an explanation in out_error_
msg.
Example
DECLARE
out_rowid_batchgroup_log cmxlb.cmx_small_str;
out_error_msg cmxlb.cmx_small_str;
ret_val int;
BEGIN
ret_val := cmxbg.execute_batchgroup(
'http://localhost:7001/cmx/request/process/'
, 'admin'
, 'admin'
,'localhost-mrm-XU_3009'
, 'SVR1.1VHDH'
, 'true' -- or 'false'
, 'true' -- or 'false'
, out_rowid_batchgroup_log
, out_error_msg
);
cmxlb.debug_print('execute_batchgroup: ' || ' code='|| ret_val ||
' message='|| out_error_msg || ' | out_rowid_batchgroup_log='|| out_
rowid_batchgroup_log);
END;
/
cmxbg.reset_batchgroup
Performs an HTTP POST to the ResetBatchGroupRequest operation via SIF.
Signature
FUNCTION reset_batchgroup(

Parameters
batch groups.
Returns
batchgroup_log
out_error_msg Error message text.
NUMBER Error code. If zero (0), then the stored procedure completed
successfully. If one (1), then the stored procedure returns an
explanation in out_error_msg.
Example
DECLARE
ret_val int;
BEGIN
ret_val := cmxbg.reset_batchgroup(
, 'admin'

, 'admin'
, 'SVR1.1VHDH'
, out_error_msg
);
cmxlb.debug_print('reset_batchgroup: ' || ' code='|| ret_val || '
message='|| out_error_msg || ' | out_rowid_batchgroup_log='|| out_
rowid_batchgroup_log);
END;
/
cmxbg.get_batchgroup_status
Performs an HTTP POST to the GetBatchGroupStatusRequest operation via SIF.
Signature
FUNCTION get_batchgroup_status(
, in_rowid_batchgroup_log IN cmxlb.cmx_small_str
, out_rowid_batchgroup OUT cmxlb.cmx_small_str
, out_start_rundate OUT cmxlb.cmx_small_str
, out_end_rundate OUT cmxlb.cmx_small_str
, out_run_status OUT cmxlb.cmx_small_str
, out_status_message OUT cmxlb.cmx_small_str
Parameters
batch groups.

If in_rowid_batchgroup_log is null, the most recent log for this
group will be used.
in_rowid_batchgroup_ c_repos_job_group_control.rowid_job_group_control
log
Either in_rowid_batchgroup or in_rowid_batchgroup_log is
required.
Returns
out_rowid_batchgroup c_repos_job_group.rowid_job_group
batchgroup_log
out_start_rundate Date / time when this batch job started.
out_end_rundate Date / time when this batch job ended.
out_run_status Job execution status code that is displayed in the Batch Group tool.
To learn more, see the Siperian Hub Administrator’s Guide/
out_status_message Job execution status message that is displayed in the Batch Group
tool. To learn more, see the Siperian Hub Administrator’s Guide/
out_error_msg Error message text for this stored procedure call, if applicable.
NUMBER Error code. If zero (0), then the stored procedure completed
successfully. If one (1), then the stored procedure returns an
explanation in out_error_msg.
Example
DECLARE
out_rowid_batchgroup cmxlb.cmx_small_str;
out_start_rundate cmxlb.cmx_small_str;
out_end_rundate cmxlb.cmx_small_str;
out_run_status cmxlb.cmx_small_str;
out_status_message cmxlb.cmx_small_str;

out_returncode int;
ret_val int;
BEGIN
ret_val := cmxbg.get_batchgroup_status(
, 'admin'
, 'admin'
, 'SVR1.1VHDH'
, null
, out_rowid_batchgroup
, out_start_rundate
, out_end_rundate
, out_run_status
, out_status_message
, out_error_msg
);
cmxlb.debug_print('get_batchgroup_status: ' || ' code='|| ret_val
|| ' message='|| out_error_msg || ' | status=' || out_status_message
|| ' | out_rowid_batchgroup_log='|| out_rowid_batchgroup_log);
END;
/

Developing Custom Stored Procedures for Batch Jobs
Developing Custom Stored Procedures for Batch

Jobs
This section describes how to create and register custom stored procedures for batch
jobs that can be added to batch groups for your Siperian Hub implementation.
About Custom Stored Procedures

Siperian Hub also allows you to create and run custom stored procedures for batch
groups. After developing the custom stored procedure, you must register it in order to
make it available to users as batch jobs in the Batch Viewer and Batch Group tools in
the Hub Console. To learn more about these tools, see the Siperian Hub Administrator’s
Guide.
Required Execution Parameters for Custom Batch Jobs

The following parameters are required for custom batch jobs. During its execution, a
custom batch job can call other MRM procedures to register metrics: cmxut.set_
metric_value.
Signature
PROCEDURE example_job(
in_rowid_table_object IN cmxlb.cmx_rowid -- c_repos_
table_object.rowid_table_object, result of cmxut.REGISTER_CUSTOM_
TABLE_OBJECT
,in_user_name IN cmxlb.cmx_user_name -- username
calling the function
,in_rowid_job IN cmxlb.cmx_rowid -- c_repos_
job_control.rowid_job, for reference, do not update status
,out_err_msg OUT varchar -- message
about success or error
,out_err_code OUT int -- >=0:
Completed successfully
-- <0: error
);

Parameters
in_rowid_table_object c_repos_table_object.rowid_table_object
IN cmxlb.cmx_rowid
Result of cmxut.REGISTER_CUSTOM_TABLE_OBJECT
in_user_name User name calling the function.
IN cmxlb.cmx_user_
name
Returns
out_err_msg Error message text.
out_err_code Error code.
Example Custom Stored Procedure

/*<TOAD_FILE_CHUNK>*/
CREATE or replace PACKAGE cmxbg_example
AS
PROCEDURE update_table(
TABLE_OBJECT
-- <0: error
);
end cmxbg_example;
/
/*<TOAD_FILE_CHUNK>*/

CREATE OR REPLACE PACKAGE BODY cmxbg_example

AS
PROCEDURE update_table(
TABLE_OBJECT
-- <0: error
)
AS
BEGIN
DECLARE
cutoff_date DATE;
record_count INT;
run_status INT;
status_message VARCHAR2 (2000);
start_date DATE := SYSDATE;
mrm_rowid_table cmxlb.cmx_rowid;
obj_func_type CHAR (1);
job_id CHAR (14);
sql_stmt VARCHAR2 (2000);
table_name VARCHAR2(30);
ret_code INT;
register_job_err EXCEPTION;
BEGIN
sql_stmt :=
'alter session set nls_date_format=''dd mon yyyy
hh24:mi:ss''';
EXECUTE IMMEDIATE sql_stmt;
cmxut.debug_print ('Start of custom batch job...');

obj_func_type := 'A';
SELECT rowid_table
INTO mrm_rowid_table
FROM c_repos_table_object

WHERE rowid_table_object = in_rowid_table_object;
SELECT start_run_date
INTO cutoff_date
FROM c_repos_job_control
WHERE rowid_job = in_rowid_job;
IF cutoff_date IS NULL then

cutoff_date := SYSDATE - 7;
END IF;
-- procedure can be registered on different tables, so get

the table_name
SELECT table_name
INTO table_name
FROM c_repos_table rt, c_repos_table_object rto
WHERE rto.ROWID_TABLE_OBJECT = in_rowid_table_object
AND rto.ROWID_TABLE = rt.ROWID_TABLE;
-- The real work!

sql_stmt :=
'update ' || table_name || ' set zip4 = ''0000'',
last_update_date = '''
|| cutoff_date
|| ''''
|| ' where zip4 is null';
cmxut.debug_print (sql_stmt);
EXECUTE IMMEDIATE sql_stmt;
record_count := SQL%ROWCOUNT;
COMMIT;
-- for testing, sleep to make the procedure take longer

-- dbms_lock.sleep(5);
-- Set zero or many metrics about the job

cmxut.set_metric_value (in_rowid_job,
1, -- c_repos_job_metric_
type.metric_type_code
record_count,
out_err_code,
out_err_msg
);
COMMIT;
IF record_count <= 0 THEN

out_err_msg := 'Failed to update records.';

out_err_code := -1;
ELSE
IF out_err_code >= 0 THEN
out_err_msg := 'Completed successfully.';
END IF;
-- else keep success code and msg from set_metric_value
END IF;
EXCEPTION
WHEN OTHERS
THEN
out_err_code := SQLCODE;
out_err_msg := SUBSTR (SQLERRM, 1, 200);
END;
END;
END cmxbg_example;
/
Registering a Custom Stored Procedure

You must register a custom stored procedure with Siperian Hub in order to make it
available to users in the Batch Group tool in the Hub Console. To register a custom
stored procedure, you need to call the following in c_repos_table_object:
cmxut.REGISTER_CUSTOM_TABLE_OBJECT
The same custom job can be registered multiple times for different tables (in_rowid_
table).
Signature
PROCEDURE register_custom_table_object(
, in_obj_func_type_code VARCHAR
, in_obj_func_type_desc VARCHAR
, in_object_name VARCHAR );

Parameters
in_rowid_table Foreign key to c_repos_table.rowid_table.
cmxlb.cmx_rowid
When the Hub Server calls the custom job in a batch group, this
value is passed in.
in_obj_func_type_ Job type code. Must be 'A' for batch group custom jobs.
code
in_obj_func_type_ Display name for the custom batch job in the Batch Groups tool in
desc the Hub Console.
in_object_name package.procedure name of the custom job.
Example
BEGIN
cmxut.REGISTER_CUSTOM_TABLE_OBJECT (
'SVR1.RS1B ' -- c_repos_table.rowid_table
,'A' -- job type, must be 'A' for batch group
,'cmxbg_example.update_table example' -- display name
,'cmxbg_example.update_table' -- package.procedure
);
END;

8
Implementing Custom Buttons in Hub
Console Tools
This chapter explains how, in your Siperian Hub implementation, you can add custom
buttons to tools in the Hub Console that allow users to invoke external services on
demand.
Chapter Contents
• About Custom Buttons in the Hub Console
• Adding Custom Buttons
About Custom Buttons in the Hub Console

In your Siperian Hub implementation, you can provide Hub Console users with
custom buttons that can be used to extend your Siperian Hub implementation. Custom
buttons can provide users with on-demand, real-time access to specialized data
services. Custom buttons can be added to any of the following tools in the Hub
Console: Merge Manager, Data Manager, and Hierarchy Manager.
Custom buttons can give users the ability to invoke a particular external service (such
as retrieving data or computing results), perform a specialized operation (such as
launching a workflow), and other tasks. Custom buttons can be designed to access data
services by a wide range of service providers, including—but not limited
to—enterprise applications (such as CRM or ERP applications), external service
providers (such as foreign exchange calculators, publishers of financial market indexes,
151
or government agencies), and even Siperian Hub itself (see Siperian Services Integration
Framework Guide).
For example, you could add a custom button that invokes a specialized cleanse
function, offered as a Web service by a vendor, that cleanses data in the customer
record that is currently selected in the Data Manager screen. When the user clicks the
button, the underlying code would capture the relevant data from the selected record,
create a request (possibly including authentication information) in the format expected
by the Web service, and then submit that request to the Web service for processing.
When the results are returned, the Data Manager displays the information in a separate
Swing dialog (if you created one and if you implemented this as a client custom
function) with the customer rowid_object from Siperian Hub.
Custom buttons are not installed by default, or are they required for every Siperian
Hub implementation. For you each custom button that you want to add, you need to
implement a Java interface, package the implementation in a JAR file, and deploy it by
running a command-line utility. To control the appearance of the custom button in the
Hub Console, you can supply either text or an icon graphic in any Swing-compatible
graphic format (such as JPG, PNG, or GIF).
How Custom Buttons Appear in the Hub Console

This section shows how custom buttons, once implemented, will appear in the Merge
Manager and Data Manager tools of the Hub Console.

Custom Buttons in the Merge Manager
Custom buttons are displayed in the top panel of the Merge Manager screen, as shown
in the following example.
Implementing Custom Buttons in Hub Console Tools 153

Custom Buttons in the Data Manager
Custom buttons are displayed in the top panel of the Data Manager screen, as shown
in the following example.
What Happens When a User Clicks a Custom Button

When a user clicks a custom button in the Hub Console, the Hub Console invokes the
request, passing content and context to the external service. Examples include record
keys and other data from a base object, package information, and so on. Execution is
asynchronous—the user can continue to work in the Hub Console while the request is
processed.
The custom code can process the service response as appropriate—log the results,
display the data to the user in a separate Swing dialog (if custom-coded and the custom
function is client-side), allow users to copy and paste the results into a data entry field,
execute real-time PUTs of the data back into the correct business objects, and so on.

Adding Custom Buttons

To add a custom button to the Hub Console in your Siperian Hub implementation,
complete the following tasks:
1. Determine the details of the external service that you want to invoke, such as the
format and parameters for request and response messages.
2. Write and package the business logic that the custom button will execute, as
described in “Writing a Custom Function” on page 155.
3. Deploy the package so that it appears in the applicable tool(s) in the Hub Console,
as described in “Deploying Custom Buttons” on page 159.
Once an external service button is visible in the Hub Console, users can click the
button to invoke the service.
Writing a Custom Function

To build an external service invocation, you write a custom function that executes the
application logic when a user clicks the custom button in the Hub Console.
The application logic implements the following Java interface:
com.siperian.mrm.customfunctions.api.CustomFunction
To learn more about this interface, see the Javadoc that accompanies your Siperian
Hub distribution.
Server-Based and Client-Based Custom Functions
Execution of the application logic occurs on either:
Environment Description
Client UI-based custom function—Recommended when you want to display
elements in the user interface, such as a separate dialog that displays
response information. To learn more, see “Example Client-Based Custom
Function” on page 156.

Environment Description
Server Server-based custom button—Recommended when it is preferable to call
the external service from the server for network or performance reasons.
To learn more, see “Example Server-Based Function” on page 157.
Example Custom Functions
This section provides the Java code for two example custom functions that implement
the com.siperian.mrm.customfunctions.api.CustomFunction interface. The code simply
prints (on standard error) information to the server log or the Hub Console log.
Example Client-Based Custom Function
The name of the client function class for the following sample code is
com.siperian.mrm.customfunctions.test.TestFunction.
//=====================================================================
//project: Siperian Master Reference Manager, Hierarchy Manager
//---------------------------------------------------------------------
//copyright: Siperian Inc. (c) 2003-2006. All rights reserved.
//=====================================================================
package com.siperian.mrm.customfunctions.test;
import java.awt.Frame;
import java.util.Properties;
import javax.swing.Icon;
import com.siperian.mrm.customfunctions.api.CustomFunction;
public class TestFunctionClient implements CustomFunction {
public void executeClient(Properties properties, Frame frame, String username,

String password, String orsId, String baseObjectRowid, String baseObjectUid, String
packageRowid, String packageUid, String[] recordIds) {
System.err.println("Called custom test function on the client with the
following parameters:");
System.err.println("Username/Password: '" + username + "'/'" + password +
"'");
System.err.println(" ORS Database ID: '" + orsId + "'");

System.err.println("Base Object Rowid: '" + baseObjectRowid + "'");

System.err.println(" Base Object UID: '" + baseObjectUid + "'");
System.err.println(" Package Rowid: '" + packageRowid + "'");
System.err.println(" Package UID: '" + packageUid + "'");
System.err.println(" Record Ids: ");
for(int i = 0; i < recordIds.length; i++) {
System.err.println(" '"+recordIds[i]+"'");
}
System.err.println(" Properties: " + properties.toString());
}
public void executeServer(Properties properties, String username, String

password, String orsId, String baseObjectRowid, String baseObjectUid, String
System.err.println("This method will never be called because
getExecutionType() returns CLIENT_FUNCTION");
}
public String getActionText() { return "Test Client"; }
public int getExecutionType() { return CLIENT_FUNCTION; }
public Icon getGuiIcon() { return null; }
Example Server-Based Function
The name of the server function class for the following code is
com.siperian.mrm.customfunctions.test.TestFunctionClient.
//=====================================================================
//project: Siperian Master Reference Manager, Hierarchy Manager
//---------------------------------------------------------------------
//copyright: Siperian Inc. (c) 2003-2006. All rights reserved.
//=====================================================================
package com.siperian.mrm.customfunctions.test;
import java.awt.Frame;
import java.util.Properties;
import javax.swing.Icon;

import com.siperian.mrm.customfunctions.api.CustomFunction;
/**
* This is a sample custom function that is executed on the Server.
* To deploy this function, put it in a jar file and upload the jar file
* to the DB using DeployCustomFunction.
*/
public class TestFunction implements CustomFunction {
public String getActionText() {
return "Test Server";
}
public Icon getGuiIcon() {

return null;
}
public void executeClient(Properties properties, Frame frame, String username,

String password, String orsId, String baseObjectRowid, String baseObjectUid, String
System.err.println("This method will never be called because
getExecutionType() returns SERVER_FUNCTION");
}
public void executeServer(Properties properties, String username, String

password, String orsId, String baseObjectRowid, String baseObjectUid, String
System.err.println("Called custom test function on the server with the
following parameters:");
System.err.println("Username/Password: '" + username + "'/'" + password +
"'");
System.err.println(" ORS Database ID: '" + orsId + "'");
System.err.println("Base Object Rowid: '" + baseObjectRowid + "'");
System.err.println(" Base Object UID: '" + baseObjectUid + "'");
System.err.println(" Package Rowid: '" + packageRowid + "'");
System.err.println(" Package UID: '" + packageUid + "'");
System.err.println(" Record Ids: ");
for(int i = 0; i < recordIds.length; i++) {
System.err.println(" '"+recordIds[i]+"'");
}
System.err.println(" Properties: " + properties.toString());
}
public int getExecutionType() {

return SERVER_FUNCTION;
}

Controlling the Custom Button Appearance

To control the appearance of the custom button in the Hub Console, you implement
one of the following methods in the
com.siperian.mrm.customfunctions.api.CustomFunction interface:
Method Description
getActionText Specify the text for the button label. Uses the default visual appearance for
custom buttons.
getGuiIcon Specify the icon graphic in any Swing-compatible graphic format (such as
JPG, PNG, or GIF). This image file can be bundled with the JAR for this
custom function.
Custom buttons are displayed alphabetically by name in the Hub Console.
Deploying Custom Buttons

Before users can see the custom buttons in the Hub Console, you need to explicitly
add them using the DeployCustomFunction utility from the command line.
To deploy custom buttons:

1. Get to a command prompt.
2. Run the DeployCustomFunction utility by specifying following command at the

command prompt:
3. When prompted, specify the database type.
4. When prompted, specify database connection information:
• Oracle: database host, port, service, login username, and password
• DB2: host, port, database, schema, username, and password.

5. The DeployCustomFunction tool displays a menu of the following options.
Label Description
(L)ist Displays a list of currently-defined custom buttons.
(A)dd Adds a new custom button. The DeployCustomFunction tool prompts
you to specify:
• the JAR file for your custom button
• the name of the custom function class that implements the
com.siperian.mrm.customfunctions.api.CustomFunction interface
• the type of the custom button: d—Data Manager, m—Merge
Manager, and /or h—Hierarchy Manager (you can specify one,
two, or three letters)
(U)pdate Updates the JAR file for an existing custom button.
The DeployCustomFunction tool prompts you to specify:
• the rowID of the custom button to update
• the JAR file for your custom button
• the name of the custom function class that implements the
com.siperian.mrm.customfunctions.api.CustomFunction interface
(C)hange Type Changes the type of an existing custom button. The
DeployCustomFunction tool prompts you to specify:
• the rowID of the custom button to update
(S)et Properties Specify a properties file, which defines name/value pairs that the
custom function requires at execution time (name=value).
The DeployCustomFunction tool prompts you to specify the
properties file to use.
(D)elete Deletes an existing custom button. The DeployCustomFunction tool
prompts you to specify the rowID of the custom button to delete.
(Q)uit Exits the DeployCustomFunction tool.
6. When you have finished choosing your actions, choose (Q)uit.

7. Refresh the browser window to display the custom button you just added.
8. Test your a custom button to ensure that it works properly.


Glossary
accept limit
A number that determines the acceptability of a match. The accept limit is defined by
Siperian within a population in accordance with its match purpose.
Admin source system
Default source system. Used for manual trust overrides and data edits from the Data
Manager or Merge Manager tools. See source system.
administrator
Siperian Hub user who has the primary responsibility for configuring the Siperian Hub
system. Administrators access Siperian Hub through the Hub Console, and use
Siperian Hub tools to configure the objects in the Hub Store, and create and modify
Siperian Hub security.
authentication
Process of verifying the identity of a user to ensure that they are who they claim to be.
In Siperian Hub, users are authenticated based on their supplied credentials—user
name / password, security payload, or a combination of both. Siperian Hub provides
an internal authentication mechanism and also supports user authentication via
third-party authentication providers. See credentials, security payload.
163
authorization
Process of determining whether a user has sufficient privileges to access a requested

Siperian Hub resource. In Siperian Hub, resource privileges are allocated to roles. Users
and user groups are assigned to roles. A user’s resource privileges are determined by
the roles to which they are assigned, as well as by the roles assigned to the user
group(s) to which the user belongs. See user, user group, role, resource, and privilege.
autolink
Process of linking records automatically. For link-style base objects only. Match rules
can result in automatic linking or manual linking. A match rule that instructs Siperian
Hub to perform an autolink will link two or more records of a base object table
automatically, without manual intervention. See manual link, link-style base object.
automerge
Process of merging records automatically. For merge-style base objects only. Match
rules can result in automatic merging or manual merging. A match rule that instructs
Siperian Hub to perform an automerge will combine two or more records of a base
object table automatically, without manual intervention. See manual merge, merge-style
base object.
base object
A table that contains information about an entity that is relevant to your business, such
as customer or account.
batch group
A collection of individual batch jobs (for example, Stage, Load, and Match jobs) that
can be executed with a single command. Each batch job in a group can be executed
sequentially or in parallel to other jobs. See also batch job.

batch job
A program that, when executed, completes a discrete unite of work (a process).

For example, the Match job carries out the match process, checking the specified
match condition for the records of a base object table and then queueing the matched
records for either automerge (Automerge job) or manual merge (Manual Merge job).
See also batch group.
batch mode
Way of interacting with Siperian Hub via batch jobs, which can be executed in the Hub
Console or using third-party management tools to schedule and execute batch jobs (in
the form of stored procedures) on the database server. See also real-time mode, batch
job, batch group, stored procedure.
best version of the truth
A record that has been consolidated with the best cells of data from the source records.
Sometimes abbreviated as BVT. The precise definition depends on the base object
style:
• For merge-style base objects, the base object record is the BVT record, and is built
by consolidating with the most-trustworthy cell values from the corresponding
source records.
• For link-style base objects, the BVT Snapshot job will build the BVT record(s) by
consolidating with the most-trustworthy cell values from the corresponding linked
base object records and return to the requestor a snapshot for consumption.
bulk merge
See automerge.
bulk unmerge
See unmerge.
Glossary 165
BVT
See best version of the truth.
cascade unmerge
During the unmerge process, if this feature is enabled, when records in the parent
object are unmerged, Siperian Hub also unmerges affected records in the child base
object. See unmerge.
cell
Intersection of a column and a record in a table. A cell contains a data value or null.
cleanse
See data cleansing.
cleanse engine
A cleanse engine is a third party product used to perform data cleansing with the
Siperian Hub.
cleanse function
Code changes the incoming data during Stage jobs, converting each input string to an
output string. Typically, these functions are used to standardize data and thereby
optimize the match process. By combining multiple cleanse functions, you can perform
complex filtering and standardization.
cleanse list
A logical grouping of cleanse functions that are executed at run time in a predefined
order. See cleanse function, data cleansing.

column
In a table, a set of data values of a particular type, one for each row of the table.
conditional mapping
A mapping between a column in a landing table and a staging table that uses a SQL
WHERE clause to conditionally select only those records in the landing table that meet
the filter condition. See mapping, distinct mapping.
consolidation process
Process of merging or linking duplicate records into a single record. The goal in
Siperian Hub is to identify and eliminate all duplicate data and to merge or link them
together into a single, consolidated record while maintaining full traceability.
consolidation indicator
Represents the state of a record in a base object. Stored in the CONSOLIDATION_

IND column. The consolidation indicator is one of the following values:
Indicator
Value Meaning Purpose
1 Consolidated Indicates the record that has been determined to be unique.
2 Queued for Indicates that the record has gone through the match process.
Merge
3 Queued for Indicates that the record is ready to be put through the match
Match process against the rest of the records in the base object.
4 New Indicates that the record has been newly loaded into the base
object and has not gone through the match process.
9 On hold Indicates that the Data Steward has put the record on hold, to
deal with later.
Glossary 167
control table
A type of system table in an ORS that Siperian Hub automatically creates for a base
object. Control tables are used in support of the stage and load processes. For each
trust-enabled column in a base object, Siperian Hub maintains a record (the last update
date and an identifier of the source system) in a corresponding control table
credentials
What a user supplies at login time to gain access to Siperian Hub resources. Credentials
are used during the authorization process to determine whether a user is who they
claim to be. Login credentials might be a user name and password, a security payload
(such as a security token or some other binary data), or a combination of user
name/password and security payload. See authentication, security payload.
cross-reference table
A type of system table in an ORS that Siperian Hub automatically creates for a base
object. For each record of the base object, the cross-reference table contains one
record per source system. This record contains the primary key from the source system
and the most recent value that the source system has provided for each cell in the base
object table.
Customer Data Integration (CDI)
A discipline within Master Data Management (MDM) that focuses on customer master
data and its related attributes. See master data.
data cleansing
Process of standardizing data content and layout, decomposing/parsing text values into
identifiable elements, verifying identifiable values (such as postal codes) against data
libraries, and replacing incorrect values with correct values from data libraries.

data steward
Siperian Hub user who has the primary responsibility for data quality. Data stewards
access Siperian Hub through the Hub Console, and use Siperian Hub tools to
configure the objects in the Hub Store.
data type
Defines the characteristics of permitted values in a table column—characters, numbers,

dates, binary data, and so on. Siperian Hub uses a common set of data types for
columns that map directly data types for the database platform (Oracle or DB2) used
in your Siperian Hub implementation.
database
Organized collection of data in the Hub Store. Siperian Hub supports two types of
databases: a Master Database and an Operational Record Store (Operational Record
Store). See Master Database, Operational Record Store (ORS), and Hub Store.
data cleansing
The process of standardizing data content and layout, decomposing and parsing text
values into identifiable elements, verifying identifiable values (such as zip codes) against
data libraries, and replacing incorrect values with correct values from data libraries. See
cleanse function.
Data Manager
Tool used to review the results of all merges—including automatic merges—and to

correct data content if necessary. It provides you with a view of the data lineage for
each base object record. The Data Manager also allows you to unmerge previously
merged records, and to view different types of history on each consolidated record.
Glossary 169
datasource
In the application server environment, a datasource is a JDBC resource that identifies

information about a database, such as the location of the database server, the database
name, the database user ID and password, and so on. Siperian Hub needs this
information to communicate with an ORS.
decay curve
Visually shows the way that trust decays over time. Its shape is determined by the
configured decay type and decay period. See decay period, decay type.
decay period
The amount of time (days, weeks, months, quarters, and years) that it takes for the trust
level to decay from the maximum trust level to the minimum trust level. See decay
curve, decay type.
decay type
The way that the trust level decreases during the decay period. See linear decay, RISL
decay, SIRL decay, decay curve, decay period.
delta detection
During the stage process, Siperian Hub only processes new or changed records when
this feature is enabled. Delta detection can be done either by comparing entire records
or via a date column.
dependent object
A table that is used to store detailed information about the records in a base object (for
example, supplemental notes). One record in a base object table can map to multiple
records in a dependent object table.

distinct mapping
A mapping between a column in a landing table and a staging table that selects only the
distinct records from the landing table. Using distinct mapping is useful in situations in
which you have a single landing table feeding multiple staging tables and the landing
table is denormalized (for example, it contains both customer and address data). See
mapping, conditional mapping.
distinct source system
A source system that provides data that gets inserted into the base object without being
consolidated. See source system.
downgrade
Operation that occurs during the load process when a validation rule reduces the trust
for a record by a percentage.
duplicate
One or more records in which the data in certain columns (such as name, address, or
organization data) is identical or nearly identical. Match rules executed during the
match process determine whether two records are sufficiently similar to be considered
duplicates for consolidation purposes.
entity
In Hierarchy Manager, an entity is any object, person, organization, place or thing that
has meaning and can be acted upon in your database. Examples include a specific
person’s name, a specific checking account number, a specific company, a specific
address, and so on. See entity type.
entity base object
An entity base is a base object used to store information about Hierarchy Manager
entities. See entity type and entity.
Glossary 171
entity type
In Hierarchy Manager, an entity type is a logical classification of one or more entities.

Examples include doctors, checking accounts, banks, and so on. All entities with the
same entity type are stored in the same entity object. In the HM Configuration tool,
entity types are displayed in the navigation tree under the Entity Object with which the
Type is associated. See entity.
exact match
A match / search strategy that matches only records that are identical. If you specify an
exact match, you can define only exact match columns for this base object
(exact-match base objects cannot have fuzzy match columns). A base object that uses
the exact match / search strategy is called an exact-match base object. See also match /
search strategy, fuzzy match.
external application user
Siperian Hub user who access Siperian Hub data indirectly via third-party applications.
extract-transform-load (ETL) tool
A software tool (external to Siperian Hub) that extracts data from a source system,
transforms the data (using rules, lookup tables, and other functionality) to convert it to
the desired state, and then loads (writes) the data to a target database. For Siperian Hub
implementations, ETL tools are used to extract data from source systems and populate
the landing tables.
foreign key
In a relational database, a column (or set of columns) whose value corresponds to a

primary key value in another table (or, in rare cases, the same table). The foreign key
acts as a pointer to the other table. For example, the Department_Number column in
the Employee table would be a foreign key that points to the primary key of the
Department table.

fuzzy match
A match / search strategy that uses probabilistic matching, which takes into account
spelling variations, possible misspellings, and other differences that can make matching
records non-identical. If selected, Siperian Hub adds a special column (Fuzzy Match
Key) to the base object. This column is the primary field used during searching and
matching to generate match candidates for this base object. All fuzzy base objects have
one and only one Fuzzy Match Key. A base object that uses the fuzzy match / search
strategy is called a fuzzy-match base object. Using fuzzy match requires a selected
population. See also match / search strategy, exact match, and population.
GET
A Siperian Hub operation that gets the specified (by key) record and, optionally,
content metadata from a base object. See PUT.
global business identifier (GBID)
A column that contains common identifiers (key values) that allow you to uniquely and
globally identify a record based on your business needs. Examples include:
• identifiers defined by applications external to Siperian Hub, such as ERP or CRM
systems.
• Identifiers defined by external organizations, such as industry-specific codes (AMA
numbers, DEA numbers. and so on), or government-issued identifiers (social
security number, tax ID number, driver’s license number, and so on).
global role
A role (list of assigned verbs) that applies to the entire Operational Reference Store
(ORS).
hierarchy
In Hierarchy Manager, a set of relationship types. These relationship types are not
ranked based on the place of the entities of the hierarchy, nor are they necessarily
related to each other. They are merely relationship types that are grouped together for
Glossary 173
ease of classification and identification. See hierarchy type, relationship, relationship
type.
hierarchy type
In Hierarchy Manager, a logical classification of hierarchies. The hierarchy type is the

general class of hierarchy under which a particular relationship falls. See hierarchy.
history table
A type of table in an ORS that contains historical information about changes to an

associated table. History tables provide detailed change-tracking options, including
merge and unmerge history, history of the pre-cleansed data, history of the base object,
and history of the cross-reference.
HM package
A Hierarchy Manager package represents a subset of an MRM package and contains

the metadata needed by Hierarchy Manager.
hotspot
In business data, a group of records representing overmatched data—a large

intersection of matches.
Hub Store
In a Siperian Hub implementation, the database that contains the Master Database and
one or more Operational Record Stores (ORSs). See Master Database, Operational
Record Store (ORS).
immutable source
A data source that always provides the best, final version of the truth for a base object.
Records from an immutable source will be accepted as unique and, once a record from
that source has been fully consolidated, it will not be changed—even in the event of a

merge. Immutable sources are also distinct systems. For all source records from an
immutable source system, the consolidation indicator for Load and PUT is always 1
(consolidated record).
implementer
Siperian Hub user who has the primary responsibility for designing, developing, testing,
and deploying Siperian Hub according to the requirements of an organization. Tasks
include (but are not limited to) creating design objects, building the schema, defining
match rules, performance tuning, and other activities.
incremental load
Any load process that occurs after a base object has undergone its initial data load.
Called incremental loading because only new or updated data is loaded into the base
object. Duplicate data is ignored. See initial data load.
initial data load
The very first time that you data is loaded into an empty base object. During the initial
data load, all records in the staging table are inserted into the base object as new
records.
intertable matching
Process of matching on the match columns of a child base object. Match columns can
be used to match on a match column from a child base object, which in turn can be
based on any text column or combination of text columns in the child base object. See
match column, match process.
job execution log
In the Batch Viewer and Batch Group tools, a log that shows job completion status
with any associated messages, such as success, failure, or warning.
Glossary 175
job execution script
For Siperian Hub implementations, a script that is used in job scheduling software
(such as Tivoli or CA Unicenter) that executes Siperian Hub batch jobs via stored
procedures.
key match job
A Siperian Hub batch job that matches records from two or more sources when these
sources use the same primary key. Key Match jobs compare new records to each other
and to existing records, and then identify potential matches based on the comparison
of source record keys as defined by the primary key match rules. See primary key match
rule, match process.
key type
Identifies important characteristics about the match key to help Siperian Hub generate
keys correctly and conduct better searches. Siperian Hub provides the following match
key types: Person_Name, Organization_Name, and Address_Part1. See match process.
key width
During match, determines how fast searches are during match, the number of possible
match candidates returned, and how much disk space the keys consume. Key width
options are Standard, Extended, Limited, and Preferred. Key widths apply to fuzzy
match objects only. See match process.
land process
Process of populating landing tables from a source system. See source system, landing
table.
landing table
A table where a source system puts data that will be processed by Siperian Hub.

linear decay
The trust level decreases in a straight line from the maximum trust to the minimum
trust. See decay type, trust.
link process
Process of linking two or more records in a base object table because they have the
same value (or very similar values) in the specified match columns. Contrast with merge
process. See consolidation process, autolink, manual link, manual unlink.
link-style base object
Type of base object that is used with Siperian Hub’s match and link capabilities.
Link-style base objects have an associated LINK table. See link process.
load insert
When records are inserted into the target table (base object or dependent object).
During the load process, if a record in the staging table does not already exist in the
target table, then Siperian Hub inserts the record into the target table. See load process,
load update.
load process
Process of loading data from a staging table into the corresponding base object or
dependent object in the Hub Store. If the new data overlaps with existing data in the
Hub Store, Siperian Hub uses trust settings and validation rules to determine which
value is more reliable. See trust, validation rule, load insert, load update.
load update
When records are inserted into the target table (base object or dependent object).
During the load process, if a record in the staging table does not already exist in the
target table, then Siperian Hub inserts the record into the target table. See load process,
load insert.
Glossary 177
lookup
Process of retrieving a data value from a parent table during Load jobs. In Siperian
Hub, when configuring a staging table associated with a base object, if a foreign key
column in the staging table (as the child table) is related to the primary key in a parent
table, you can configure a lookup to retrieve data from that parent table.
manual link
Process of merging records manually. For link-style base objects only. Match rules can
result in automatic linking or manual linking. A match rule that instructs Siperian Hub
to perform a manual link identifies records that have enough points of similarity to
warrant attention from a data steward, but not enough points of similarity to allow the
system to automatically link the records. See autolink, link-style base object.
manual merge
Process of merging records manually. For merge-style base objects only. Match rules
can result in automatic merging or manual merging. A match rule that instructs
Siperian Hub to perform a manual merge identifies records that have enough points of
similarity to warrant attention from a data steward, but not enough points of similarity
to allow the system to automatically merge the records. See automerge, merge-style
base object.
manual unmerge
Process of unmerging records manually. For merge-style base objects only. See manual
merge, merge-style base object.
manual unlink
Process of unlinking records manually. For link-style base objects only. See manual link,
link-style base object.

mapping
Defines a set of transformations that are applied to source data. Mappings are used
during the stage process (or via a Siperian Hub operation) to transfer data from a
landing table to a staging table. A mapping identifies the source column in the landing
table and the target column to populate in the staging table, along with any
intermediate cleanse functions used to clean the data. See conditional mapping, distinct
mapping.
master data
A collection of common, core entities—along with their attributes and their

values—that are considered critical to a company's business, and that are required for
use in two or more systems or business processes. Examples of master data include
customer, product, employee, supplier, and location data. See Master Data
Management (MDM), Customer Data Integration (CDI).
Master Data Management (MDM)
The controlled process by which the master data is created and maintained as the
system of record for the enterprise. MDM is implemented in order to ensure that the
master data is validated as correct, consistent, and complete,
and—optionally—circulated in context for consumption by internal or external
business processes, applications, or users. See master data, Customer Data Integration
(CDI).
Master Database
Database that contains all the Siperian Hub metadata, including configuration settings
and other information that Siperian Hub requires to run properly. The default name of
the Master Database is CMX_SYSTEM. See also Operational Record Store (ORS).
Glossary 179
match
The process of determining whether two records should be automatically merged or

should be candidates for manual merge because the two records have identical or
similar values in the specified columns. See match process.
match codes
Strings of characters representing the contents of the data to be compared. During the
match process, the more complex match types result in the generation of sophisticated
match codes based on the degree of similarity required. See also tokenizing, match
process.
match column
A column that is used in a match rule for comparison purposes. Each match column is
based on one or more columns from the base object. See match process.
match column rule
Match rule that is used to match records based on the values in columns you have
defined as match columns, such as last name, first name, address1, and address2. See
primary key match rule, match process.
match list
Define custom-built standardization lists. Functions are pre-defined functions that

provide access to specialized cleansing functionality such as address verification or
address decomposition. See match process.
match path
Allows you to traverse the hierarchy between records—whether that hierarchy exists
between base objects (inter-table paths) or within a single base object (intra-table paths).
Match paths are used for configuring match column rules involving related records in
either separate tables or in the same table.

match process
Process of comparing two records for points of similarity. If sufficient points of

similarity are found to indicate that two records probably are duplicates of each other,
Siperian Hub flags those records for merging.
match purpose
For fuzzy-match base objects, defines the primary goal behind a match rule. For
example, if you're trying to identify matches for people where address is an important
part of determining whether two records are for the same person, then you would use
the Match Purpose called Resident. Each match purpose contains knowledge about
how best to compare two records to achieve the purpose of the match. Siperian Hub
uses the selected match purpose as a basis for applying the match rules to determine
matched records. The behavior of the rules is dependent on the selected purpose. See
match process.
match rule
Defines the criteria by which Siperian Hub determines whether records might be
duplicates. Match columns are combined into match rules to determine the conditions
under which two records are regarded as being similar enough to merge. Each match
rule tells Siperian Hub the combination of match columns it needs to examine for
points of similarity. See match process.
match rule set
A logical collection of match rules that allow users to execute different sets of rules at
different stages in the match process. Match rule sets include a search level that dictates
the search strategy, any number of automatic and manual match rules, and optionally, a
filter that allows you to selectively include or exclude records during the match process
Match rules sets are used to execute to match column rules but not primary key match
rules. See match process.
Glossary 181
match subtype
Used with base objects that containing different types of data, such as an Organization
base object containing customer, vendor, and partner records. Using match subtyping,
you can apply match rules to specific types of data within the same base object. For
each match rule, you specify an exact match column that will serve as the “subtyping”
column to filter out the records that you want to ignore for that match rule. See match
process.
match table
Type of system table, associated with a base object, that supports the match process.
During the execution of a Match job for a base object, Siperian Hub populates its
associated match table with the ROWID_OBJECT values for each pair of matched
records, as well as the identifier for the match rule that resulted in the match, and an
automerge indicator. See match process.
match token
Strings that encode the columns used to identify candidates for matching. See match
process.
match type
Each match column has a match type that determines how the match column will be
tokenized in preparation for the match comparison. See match process.
match / search strategy
Specifies the reliability of the match versus the performance you require: fuzzy or
exact. An exact match / search strategy is faster, but an exact match will miss some
matches if the data is imperfect. See fuzzy match, exact match., match process.

maximum trust
The trust level that a data value will have if it has just been changed. For example, if
source system A changes a phone number field from 555-1234 to 555-4321, the new
value will be given system A’s maximum trust level for the phone number field. By
setting the maximum trust level relatively high, you can ensure that changes in the
source systems will usually be applied to the base object.
merge process
Process of combining two or more records of a base object table because they have the
same value (or very similar values) in the specified match columns. Contrast with link
process. See consolidation process, automerge, manual merge, manual unmerge.
merge-style base object
Type of base object that is used with Siperian Hub’s match and merge capabilities. See
merge process.
Merge Manager
Tool used to review and take action on the records that are queued for manual
merging.
message
In Siperian Hub, refers to a Java Message Service (JMS) message. A message queue
server handles two types of JMS messages:
• inbound messages are used for the asynchronous processing of Siperian Hub
service invocations
• outbound messages provide a communication channel to distribute data changes
via JMS to source systems or other systems.
Glossary 183
message queue
A mechanism for transmitting data from one process to another (for example, from
Siperian Hub to an external application).
message queue rule
A mechanism for identifying base object events and transferring the effected records to
the internal system for update. Message queue rules are supported for updates, merges,
and records accepted as unique.
message queue server
In Siperian Hub, a Java Message Service (JMS) server, defined in your application
server environment, that Siperian Hub uses to manage incoming and outgoing JMS
messages.
message trigger
A rules that gets fired when which a particular action occurs within Siperian Hub.
When an action occurs for which a rule is defined, a JMS message is placed in the
outbound message queue. A message trigger also specifies the queue in which
messages are placed.
metadata
Data that is used to describe other data. In Siperian Hub, metadata is used to describe
the schema (data model) that is used in your Siperian Hub implementation. Metadata
describes the various schema definition components—tables, columns, indexes, key
relationships, and so on—in the Hub Store. See also schema, metadata validation.
metadata validation
Process of verifying the completeness and integrity of the metadata that describes a
repository (ORS). The Metadata Manager tool runs this process. See also metadata,
Operational Record Store (ORS).

minimum trust
The trust level that a data value will have when it is “old” (after the decay period has
elapsed). This value must be less than or equal to the maximum trust. If the maximum
and minimum trust are equal, the decay curve is a flat line and the decay period and
decay type have no effect. See also decay period.
non-equal matching
When configuring match rules, prevents equal values in a column from matching each
other. Non-equal matching applies only to exact match columns.
null value
The absence of a value in a column of a record. Null is not the same as blank or zero.
operation
Siperian Hub operation (API) that allows external applications to access specific
Siperian Hub functionality via the Services Integration Framework (SIF) using a
request/response model. See also GET, PUT.
Operational Record Store (ORS)
Database that contains all of the data you load or create within the Siperian Hub
system, including all master record data. A Siperian Hub configuration can have one or
more ORS databases. The default name of an ORS is CMX_ORS. See also Master
Database.
overmatching
For fuzzy-match base objects only, a match that results in too many matches, including
matches that are not relevant. When configuring match, the goal is to find the optimal
number of matches for your data. See undermatching.
Glossary 185
package
A package is a public view of one or more underlying tables in Siperian Hub. Packages
represent subsets of the columns in those tables, along with any other tables that are
joined to the tables. A package is based on a query. The underlying query can select a
subset of records from the table or from another package.
password policy
Specifies password characteristics for Siperian Hub user accounts, such as the password
length, expiration, login settings, password re-use, and other requirements. You can
define a global password policy for all user accounts in a Siperian Hub implementation,
and you can override these settings for individual users.
path
See match path.
policy decision points (PDPs)
In Siperian Hub implementations, specific security check points that determine, at run
time, the validity of a user’s identity (authentication), along with that user’s access to
Siperian Hub resources (authorization).
policy enforcement points (PEPs)
In Siperian Hub implementations, specific security check points that enforce, at run
time, security policies for authentication and authorization requests.
population
Defines certain characteristics about data in the records that you are matching. By
default, Siperian Hub comes with the US population, but Siperian provides a standard
population per country. Populations account for the inevitable variations and errors
that are likely to exist in name, address, and other identification data; specify how
Siperian Hub builds match tokens; and specify how search strategies and match

purposes operate on the population of data to be matched. Used only with the Fuzzy
match/search strategy.
primary key
In a relational database table, a column (or set of columns) whose value uniquely
identifies a record. For example, the Department_Number column would be the
primary key of the Department table.
primary key match rule
Match rule that is used to match records from two systems that use the same primary
keys for records. See also match column rule.
private resource
A protected Siperian Hub resource that is hidden from the Roles tool, preventing its
access via Services Integration Framework (SIF) operations. When you add a new
resource in Hub Console (such as a new base object), it is designated a PRIVATE
resource by default. See also secure resource, resource.
privilege
Permission to access a Siperian Hub resource. With Siperian Hub internal

authorization, each role is assigned one of the following privileges.
Privilege Allows the User To....

READ View but not change data.
CREATE Create data records in the Hub Store.
UPDATE Update data records in the Hub Store.
MERGE Merge and unmerge data.
EXECUTE Execute cleanse functions and batch groups.
Glossary 187
Privileges determine the access that external application users have to Siperian Hub
resources. For example, a role might be configured to have READ, CREATE,
UPDATE, and MERGE privileges on particular packages and package columns.
These privileges are not enforced when using the Hub Console, although the settings
still affect the use of Hub Console to some degree. See secure resource, role.
profile
In Hierarchy Manager, describes what fields and records an HM user may display, edit,
or add. For example, one profile can allow full read/write access to all entities and
relationships, while another profile can be read-only (no add or edit operations
allowed).
provider
See security provider.
provider property
A name-value pair that a security provider might require in order to access for the
service(s) that they provide.
PUT
A Siperian Hub operation that inserts or updates a record in the base object. See GET.
query
A request to retrieve data from the Hub Store. Siperian Hub allows administrators to
specify the criteria used to retrieve that data. Queries can be configured to return
selected columns, filter the result set with a WHERE clause, use complex query syntax
(such as GROUP BY, SORT BY, and HAVING clauses), and use aggregate functions
(such as SUM, COUNT, and AVG).

query group
A logical group of queries. A query group is simply a mechanism for organizing

queries. See query.
raw table
A table that archives data from a landing table.
real-time mode
Way of interacting with Siperian Hub via third-party applications, which invoke
Siperian Hub operations via the Services Integration Framework (SIF) interface. SIF
provides operations for various services, such as reading, cleansing, matching, inserting,
and updating records. See also batch mode, Services Integration Framework (SIF).
record
A row in a table that represents an instance of an object. For example, in an Address

table, a record contains a single address.
referential integrity
Enforcement of parent-child relationship rules among tables based on configured

foreign key relationship.
regular expression
A computational expression that is used to match and manipulate text data according
to commonly-used syntactic conventions and symbolic patterns. In Siperian Hub, a
regular expression function allows you to use regular expressions for cleanse
operations. To learn more about regular expressions, including syntax and patterns,
refer to the Javadoc for java.util.regex.Pattern.
Glossary 189
reject table
A table that contains records that Siperian Hub could not insert into a target table,
such as:
• staging table (stage process) after performing the specified cleansing on a record of
the specified landing table
• Hub store table (load process)
A record could be rejected because the value of a cell is too long, or because the
record’s update date is later than the current date.
relationship
In Hierarchy Manager, describes the affiliation between two specific entities. Hierarchy
Manager relationships are defined by specifying the relationship type, hierarchy type,
attributes of the relationship, and dates for when the relationship is active. See
relationship type, hierarchy.
relationship base object
A relationship base object is a base object used to store information about Hierarchy
Manager relationships.
relationship type
Describes general classes of relationships. The relationship type defines:

• the types of entities that a relationship of this type can include
• the direction of the relationship (if any)
• how the relationship is displayed in the Hub Console
See relationship, hierarchy.
repository
See Operational Record Store (ORS).

resource
Any Siperian Hub component that is used in your Siperian Hub implementation.
Certain resources can be configured as secure resources: base objects, dependent
objects, mappings, packages, remote packages, cleanse functions, HM profiles, the audit
table, and the users table. In addition, you can configure secure resources that are
accessible by SIF operations, including content metadata, match rule sets, metadata,
batch groups, the audit table, and the users table. See private resource, secure resource,
resource group.
resource group
A logical collection of secure resources that simplify privilege assignment, allowing you
to assign privileges to multiple resources at once, such as easily assigning resource
groups to a role. See resource, privilege.
RISL decay
Rapid Initial Slow Later decay puts most of the decrease at the beginning of the decay
period. The trust level follows a concave parabolic curve. If a source system has this
decay type, a new value from the system will probably be trusted but this value will
soon become much more likely to be overridden.
role
Defines a set of privileges to access secure Siperian Hub resources. See user, user
group, privilege.
row
See record.
rule
See match rule.
Glossary 191
rule set
See match rule set.
rule set filtering
Ability to exclude records from being processed by a match rule set. For example, if
you had an Organization base object that contained multiple types of organizations
(customers, vendors, prospects, partners, and so on), you could define a match rule set
that selectively processed only vendors. See match process.
sandbox
In Hierarchy Manager, a virtual playground where users with the appropriate privileges
(as defined in their HM Profile) can manipulate relationship criteria without impacting
the original master record. In this way, users can predict the results of their
manipulations and determine whether they will be meaningful in the context for which
they were created.
schema
The data model that is used in a customer’s Siperian Hub implementation. Siperian
Hub does not impose or require any particular schema. The schema is independent of
the source systems.
search levels
Defines how stringently Siperian Hub searches for matches: narrow, typical, exhaustive,
or extreme. The goal is to find the optimal number of matches for your data—not too
few (undermatching), which misses significant matches, or too many (overmatching),
which generates too many matches, including insignificant ones. See overmatching,
undermatching.

secure resource
A protected Siperian Hub resource that is exposed to the Roles tool, allowing the
resource to be added to roles with specific privileges. When a user account is assigned
to a specific role, then that user account is authorized to access the secure resources via
SIF according to the privileges associated with that role. In order for external
applications to access a Siperian Hub resource via SIF operations, that resource must
be configured as SECURE. Because all Siperian Hub resources are PRIVATE by
default, you must explicitly make a resource SECURE after the resource has been
added. See also private resource, resource.
Status Setting Description

SECURE
PRIVATE Hides this Siperian Hub resource from the Roles tool. Default. Prevents its
access via Services Integration Framework (SIF) operations. When you add
a new resource in Hub Console (such as a new base object), it is designated
a PRIVATE resource by default.
security
The ability to protect information privacy, confidentiality, and data integrity by

guarding against unauthorized access to, or tampering with, data and other resources in
your Siperian Hub implementation. See also authentication, authorization, privilege,
resource.
security provider
A third-party organization that provides security services (authentication,

authorization, and user profile services) for users accessing Siperian Hub.
security payload
Raw binary data returned by a Siperian Hub operation request that can contain
supplemental data required for further authentication and/or authorization.
Glossary 193
segment matching
Way of limiting match rules to specific subsets of data. For example, you could define
different match rules for customers in different countries by using segment matching
to limit certain rules to specific country codes. Segment matching is configured on a
per-rule basis and applies to both exact-match and fuzzy-match base objects.
Services Integration Framework (SIF)
The part of Siperian Hub that interfaces with client programs. Logically, it serves as a
middle tier in the client/server model. It enables you to implement the
request/response interactions using any of the following architectural variations:
• Loosely coupled Web services using the SOAP protocol.
• Tightly coupled Java remote procedure calls based on Enterprise JavaBeans (EJBs)
or XML.
• Asynchronous Java Message Service (JMS)-based messages.
• XML documents going back and forth via Hypertext Transfer Protocol (HTTP).
Each of the above SIF protocols sits on top of the native Siperian Hub protocol, which
accepts requests in the form of XML documents or EJBs and returns responses the
same way.
SIRL decay
Slow Initial Rapid Later decay puts most of the decrease at the end of the decay period.
The trust level follows a convex parabolic curve. If a source system has this decay type,
it will be relatively unlikely for any other system to override the value that it sets until
the value is near the end of its decay period.
source system
A system that provides data to Siperian Hub. See distinct source system.

stage process
Process of reading the data from the landing table, performing any configured
cleansing, and moving the cleansed data into the corresponding staging table. If you
enable delta detection, Siperian Hub only processes new or changed records. See
staging table, landing table.
staging table
A table where cleansed data is temporarily stored before being loaded into base objects
and dependent objects via load jobs. See stage process, load process.
stored procedure
A named set of Structured Query Language (SQL) statements that are compiled and
stored on the database server. Siperian Hub batch jobs are encoded in stored
procedures so that they can be run using job execution scripts in job scheduling
software (such as Tivoli or CA Unicenter).
stripping
Deprecated term. See tokenizing.
strip table
Deprecated term. See token table.
system column
A column in a table that contains Siperian Hub metadata. For each type of table in an
ORS, Siperian Hub automatically creates system columns. Typical system columns for a
base object include ROWID_OBJECT, CONSOLIDATION_IND, and LAST_
UPDATE_DATE. See column.
Glossary 195
table
In a database, a collection of data that is organized in rows (records) and columns.

A table can be seen as a two-dimensional set of values corresponding to an object.
The columns of a table represent characteristics of the object, and the rows represent
instances of the object. In the Hub Store, the Master Database and each Operational
Record Store (ORS) represents a collection of tables. Base objects and dependent
objects are stored as tables in an ORS.
target database
In the Hub Console, the Master Database or an Operational Record Store (ORS) that
is the target of the current tool. Tools that manage data stored in the Master Database,
such as the Users tool, require that your target database is the Master Database. Tools
that manage data stored in an ORS require that you specify which ORS to
token table
When you specify a match column, Siperian Hub creates a special key called a match
key (also known as a token string) on a special table called the token table (formerly
referred to as the strip table). Before the Siperian Hub Match batch job runs, it first
ensures that the correct match keys have been generated in the token table. The match
job compares the match keys according to the match rules that have been defined to
determine which records are duplicates. See also tokenizing.
tokenizing
Specialized form of data standardization that is performed before the match

comparisons are done. For the most basic match types, tokenizing simply removes
“noise” characters like spaces and punctuation. The more complex match types result
in the generation of sophisticated match codes—strings of characters representing the
contents of the data to be compared—based on the degree of similarity required. See
also token table, match codes.

traceability
The maintenance of data so that you can determine which systems—and which records
from those systems—contributed to consolidated records.
transactional data
Represents the actions performed by an application, typically captured or generated by

an application as part of its normal operation. It is usually maintained by only one
system of record, and tends to be accurate and reliable in that context. For example,
your bank probably has only one application for managing transactional data resulting
from withdrawals, deposits, and transfers made on your checking account.
trust
Mechanism for measuring the confidence factor associated with each cell based on its
source system, change history, and other business rules. Trust takes into account the
age of data, how much its reliability has decayed over time, and the validity of the data.
trust level
For a source system that provides records to Siperian Hub, a number between 0 and
100 that assigns a level of confidence and reliability to that source system, relative to
other source systems. The trust level has meaning only when compared with the trust
level of another source system.
trust score
The current level of confidence in a given record. During load jobs, Siperian Hub
calculates the trust score for each records. If validation rules are defined for the base
object, then the Load job applies these validation rules to the data, which might further
downgrade trust scores. During the consolidation process, when two records are
candidates for merge or link, the values in the record with the higher trust score wins.
Data stewards can manually override trust scores in the Merge Manager tool.
Glossary 197
undermatching
For fuzzy-match base objects only, a match that results in too few matches, which
misses relevant matches. When configuring match, the goal is to find the optimal
number of matches for your data. See overmatching.
unlink
Process of unlinking previously-linked records. For link-style base objects only. See
manual unlink, link-style base object.
unmerge
Process of unmerging previously-merged records. For merge-style base objects only.

See manual unmerge, merge-style base object, cascade unmerge.
user
An individual (person or application) who can access Siperian Hub resources. Users are
represented in Siperian Hub by user accounts, which are defined in the Master Database.
See user group, Master Database.
user group
A logical collection of user accounts. See user.
validation rule
Rule that tells Siperian Hub the condition under which a data value is not valid. When
data meets the criteria specified by the validation rule, the trust value for that data is
downgraded by the percentage specified in the validation rule. If the Reserve Minimum
Trust flag is set for the column, then the trust cannot be downgraded below the
column’s minimum trust.

workbench
In the Hub Console, a mechanism for grouping similar tools. A workbench is a logical
collection of related tools. For example, the Cleanse workbench contains
cleanse-related tools: Cleanse Match Server, Cleanse Functions, and Mappings.
write lock
In the Hub Console, a lock that is required in order to make changes to the underlying
schema. All non-data steward tools (except the ORS security tools) are in read-only
mode unless you acquire the write lock. Write locks prevent multiple users from
making changes to the same data at the same time.
Glossary 199
Index
A cmxbg.execute_batchgroup stored pro-
cedure 138
address household objects 45 cmxbg.get_batchgroup_status stored
addresses design patterns 43 procedure 142
analyze phase 7
cmxbg.reset_batchgroup stored proce-
analyzing data dure 140
business processes and data 12 stored procedures for 138
data cleansing 12 batch jobs
data set, assembling 11 asynchronous execution 108
data sizing 11 Auto Match and Merge jobs 113
data source characteristics, determining
Autolink jobs 112
10 Automerge jobs 115
introduction 10 C_REPOS_JOB_CONTROL table 110
match rules 14
C_REPOS_JOB_METRIC table 110
standardization rules 12 C_REPOS_JOB_METRIC_TYPE table
trust levels 13 110
validation rules 14
C_REPOS_JOB_STATUS_TYPEC ta-
API calls and validation rules 66 ble 110
asynchronous batch jobs 108 C_REPOS_TABLE_OBJECT_V table
audience x
107
Auto Match and Merge jobs 113 execution scripts 104
Autolink jobs 112 Generate Match Token jobs 118
Automerge jobs 115 Key Match jobs 120
Load jobs 121
Match Analyze jobs 128
B Match for Duplicate Data jobs 130
base objects, defined 23 Match jobs 127
batch groups results monitoring 108
about batch groups 137 scheduling 103
201
Stage jobs 131 icons 159
Unmerge jobs 133 listing 160
batch processing with validation rules 66 properties file 160
build phase 8 text labels 159
business party model 36 type change 160
business processes and data 12 updating 160
custom functions
client-based 155
C deleting 160
C_REPOS_JOB_CONTROL table 110 server-based 156
C_REPOS_JOB_METRIC table 110 writing 155
C_REPOS_JOB_METRIC_TYPE table 110 custom stored procedures
C_REPOS_JOB_STATUS_TYPEC table about custom stored procedures 145
110 example code 146
C_REPOS_TABLE_OBJECT_V table 105, parameters of 145
107 registering 149
cardinality, defined 76 customer data models
cascade unmerge 134 business party model 36
cell updates 62 differentiated customer models 36
cmxbg.execute_batchgroup 138
cmxbg.get_batchgroup_status 142
cmxbg.reset_batchgroup 140 D
communication channel models 46 data cleansing 12
conceptual models 19 data modeling
consensus deletes 58 conceptual models 19
consolidation and data modeling 30 design deliverables 18
cross-column validation 68 design patterns 42
custom buttons design principles 27
about custom buttons 151 logical models 20
adding 160 overview 18
appearance of 152 physical models 24
clicking 154 data set, assembling 11
custom functions, writing 155 data sizing 11
deploying 159 data sources
examples of 156 characteristics, analyzing 10

third-party 14 G
data tokenization 78
Generate Match Tokens jobs 118
deletes
consensus 58
direct 57
dependent objects, defined 23
H
dependent tables and match rules 91 Hierarchy Manager
deploy phase 8 described 96
design patterns implementation process 98
addresses 43 preparing to implement 97
communication channel models 46 households design patterns 42
households 42
overview 42
design phase 7
I
design principles implementation methodology 2
consolidation 30 implementation projects
customers, mixing different types of 36 methodology 2
deep versus wide 28 phases 6
independence test 33 roles 4
landing and staging data 40 intended audience x
match requirements 29
overview 27
differentiated customer model 36 K
direct deletes 57 Key Match jobs 120
dirty bits, defined 78
discover phase 6
L
landing tables 40
E linear unmerge 134
exact match column properties Load jobs 121
null match 87 logical models
overview 87 design flaws fixed, example of 22
segment matches 88 design flaws, example of 21
execution scripts 104 objects in the logical model 23
overview 20
203
pre-existing logical models 23 matching, defined 76
matchy data, defined 76
merging
M automerge 94
Match Analyze jobs 128 manual merge 94
match batch sizes 91 methodology
Match for Duplicate Data jobs 130 core principles 2
Match jobs 127 introduction 2
match key types 80
match key widths 79
match levels 84 N
match purposes null matches in exact match column proper-
defined 81 ties 87
field types 83
using 82
match rules P
best practices 86 phases in an implementation project
data analysis 14 analyze phase 7
data modeling design principles 29 build phase 8
defining 85 deploy phase 8
dependent tables 91 design phase 7
exact match column properties 87 discover phase 6
how matching works 76 overview 6
match batch sizes 91 physical models
match key types 80 example of 25
match key widths 79 flexibility for future use 26
match levels 84 overview 24
match purposes 81 performance 26
matching, defined 76 required functionality 26
populations 77 scalability 26
prerequisites for defining 76 Siperian product roadmap 27
search strategies 80 populations 77
testing 85 projects
tokenizing data 78 phases 6
tokens for match keys 77 role 4

R at match 78
at put 78
roles in an implementation project 4 tokens for match keys 77
training xiii
tree unmerge 135
S trust levels
search strategies and match rules 80 best practices 58
segment matches in exact match column cell updates 62
properties 88 configuration guidelines 60
Siperian, about xii configuring 60
sizing data 11 data analysis 13
source systems, ranking trust levels for 55 decay periods 54
Stage jobs 131 defined 52
staging tables 41 defining 61
standardization rules 12 how trust works 52
static tokens, defined 77 ranking source systems 55
stored procedures stored procedure example 63
batch groups 138 using with validation rules 70
batch jobs 111
custom stored procedures 145
U
Unmerge jobs 133
T cascade unmerge 134
tables linear unmerge 134
C_REPOS_JOB_CONTROL table 110 tree unmerge 135
C_REPOS_JOB_METRIC table 110 unmerge all 134
C_REPOS_JOB_METRIC_TYPE table
110
C_REPOS_JOB_STATUS_TYPE table V
110 validation rules
C_REPOS_TABLE_OBJECT_V table API calls 66
105 batch processing 66
tokenization incomplete bits, defined 78
best practices 68
tokenizing data 78
complex validation rules 68
at load 78 cross-column validation 68
205
data analysis 14 ordering of 66
defined 65 overview 65
grouping of 66 performance effects 68
how validation works 65 SQL statements in 69
load effects 68 using with trust levels 70

Siperian Hub Implementer Guide

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Siperian Hub Implementer Guide

Uploaded by

Copyright:

Available Formats

XU

Chapter 1: Introducing Siperian Hub Implementation

Chapter 2: Analyzing Data

Chapter 3: Designing the Data Model

Chapter 4: Using Trust Settings and Validation Rules

iv Siperian Hub XU Implementer’s Guide

Chapter 5: Configuring and Tuning Match Rules

Chapter 6: Implementing Hierarchy Manager

Chapter 7: Scheduling Batch Jobs and Batch Groups

vi Siperian Hub XU Implementer’s Guide

Chapter 8: Implementing Custom Buttons in Hub Console Tools

x Siperian Hub XU Implementer’s Guide

Learning About Siperian Hub

Siperian Hub Installation Guide

Siperian Hub Release Notes

What’s New in Siperian Hub

Siperian Hub Tutorial

Siperian Hub Administrator’s Guide

xii Siperian Hub XU Implementer’s Guide

Siperian Hub User’s Guide

Siperian Hub Implementer’s Guide

Siperian Services Integration Framework Guide

Siperian Training and Materials

Method Contact Information

xiv Siperian Hub XU Implementer’s Guide

Siperian Implementation Methodology

Reducing Project Risk

The objective of checkpoint gate reviews is not to enforce a rigid waterfall

2 Siperian Hub XU Implementer’s Guide

The Siperian implementation methodology can be used on its own or it can be

The Siperian implementation methodology is a project-based methodology that is

Introducing Siperian Hub Implementation 3

Roles in a Siperian Hub Implementation Project

4 Siperian Hub XU Implementer’s Guide

Typical Roles in a Siperian Hub Implementation Project (Cont.)

Introducing Siperian Hub Implementation 5

Phases in an Siperian Hub Implementation Project

Each phase has specific activities and deliverables.

6 Siperian Hub XU Implementer’s Guide

• Defining scope restrictions for the project

Introducing Siperian Hub Implementation 7

8 Siperian Hub XU Implementer’s Guide

Defining the Flow of Data Between Siperian Hub and

Determine Data Source Characteristics

10 Siperian Hub XU Implementer’s Guide

• Consider which external systems, including source systems, should be updated

Assemble a Statistically Representative Sample Data Set

Consider Data Sizing

Consider the Relationship Between Data and Business

Consider Data Cleansing and Standardization Rules

Consider Trust Levels and Validation Rules

12 Siperian Hub XU Implementer’s Guide

from a third-party provider. These data sources might be guaranteed to consist of

Here are some examples of validation rules:

Consider Match Rules

14 Siperian Hub XU Implementer’s Guide

16 Siperian Hub XU Implementer’s Guide

About Data Modeling for MRM

Data Model Design Deliverables

18 Siperian Hub XU Implementer’s Guide

Designing the Data Model 19

The following figure shows an example of a conceptual data model.

20 Siperian Hub XU Implementer’s Guide