You are on page 1of 3

Introduction to WebSphere QualityStage

IBM WebSphere QualityStage includes a set of stages, a Match Designer, and related files that provide a
development environment within the WebSphere DataStage and QualityStage Designer for building jobs to
cleanse data. This environment lets you test your matching and blocking strategies before running match jobs,
and lets you manage and edit rules.
The WebSphere QualityStage functionality is available as either a stand-alone subset of WebSphere
DataStage or as an add-on to WebSphere DataStage. This functionality offers the full power and flexibility of
the WebSphere DataStage parallel execution framework and connectivity stages.
The WebSphere QualityStage components include the Match Designer, for designing and testing match
specifications and associated match passes, and the following WebSphere QualityStage stage types:

Investigate
Standardize
Match Frequency
Reference Match
Unduplicate Match
Survive

WebSphere QualityStage work flow


For data cleansing, WebSphere QualityStage provides a work flow where data is processed within each stage.
The results are evaluated, reprocessed, or used as input for the next stage.

Preparing source data


As you plan your project, you need to prepare the source data to realize the best results.

WebSphere QualityStage work flow


For data cleansing, WebSphere QualityStage provides a work flow where data is processed within each stage.
The results are evaluated, reprocessed, or used as input for the next stage.
On the Designer client parallel canvas, you build a QualityStage job. Each job uses a Data Quality stage (or
stages) to process the data according to the requirements of the stage.
Thorough knowledge of the workflow can help streamline your data cleansing projects. You cleanse data using
a four-phase approach:

Phase One. Allows you to understand business goals by translating high-level directives into specific
data cleansing assignments and to make assumptions about the requirements and structure of the
target data.
Phase Two. Helps you identify errors and validates the contents of columns in a data file. Then you
use the results to refine how you are doing your business practices.
Phase Three. Allows you to condition the source data, match the data for duplicates or crossreferences to other files, and determine the surviving record.
Phase Four. Uses the results to evaluate how your organization maintains its data management and to
ensure that corporate data supports the company's goals.

An understanding of the mission that satisfies the business goals of your company can help you define the
requirements and structure of the target data. This knowledge also helps you determine the level of data quality

that your data needs to meet. This insight provides a context to help you make the appropriate decisions about
the data throughout the workflow.

Analyzing source data quality


You can use the Investigate stage to help you understand the quality of the source data and clarify the
direction of succeeding phases of the workflow. In addition, it indicates the degree of processing you
will need to create the target re-engineered data.

Data reformatting and conditioning


Standardizing data involves moving free-form data (columns that contain more than one data entry)
into fixed columns and manipulating data to conform to standard conventions. The process identifies
and corrects invalid values, standardizes spelling formats and abbreviations, and validates the format
and content of the data.

Generating match frequency data


The Match Frequency stage generates frequency data that tells you how often a particular value
appears in a particular column.

Ensuring data integrity


Matching identifies duplicate records in one file and builds relationships between records in multiple
files. Relationships are defined by business rules at the data level.

Consolidating and creating a survive record


The Survive stage allows you to specify the columns and column values from the group of input
records that create the output record.

Preparing source data


As you plan your project, you need to prepare the source data to realize the best results.
WebSphere QualityStage accepts all basic data types (non-vector, non-aggregate) other than binary. Non-basic
data types cannot be acted upon in WebSphere QualityStage except for vectors in the match stages. However,
non-basic data types can be passed through the WebSphere QualityStage stages.
Some columns need to be constructed with stages before using them in a WebSphere QualityStage stage. In
particular, create overlay column definitions, array columns, and concatenated columns as explicit columns
within the data before you use them.
For example, rather than declaring the first three characters of a five-character postal code column as a
separate additional column, you could use a Transformer stage to explicitly add the column to the source data
before using it in a WebSphere QualityStage stage.
Note: Be sure to map missing values to null.
The actual data to be matched should conform to the following practices:

The codes used in columns should be the same for both data source and reference source.
For example, if the Gender column in the data source uses M and F as gender codes, the
corresponding column in the reference source should also use M and F as gender codes (not, for
example, 1 or 0 as gender codes).

Whatever missing value condition you use (for example, spaces or 99999) must be converted in
advance to the null character. This can be done using the WebSphere DataStage Transformer stage.
If you are extracting data from a database, make sure that nulls are not converted to spaces.

Use the Standardize stage to standardize individual names or postal addresses. Complex conditions can be
handled by creating new columns before matching begins.
For example, a death indicator could be created by examining the disposition status of the patient. In a case
where one matches automobile crashes to hospital data, the E codes on the hospital record can be examined
to see if a motor vehicle accident is involved. A new variable (MVA) can be set to one. Set all other status
information to zero. On the crash file, generate a column that is always a one (since all crashes are motor
vehicle accidents). If both files report a motor vehicle accident, the columns match (one to one). Otherwise the
column do not match (one to zero).

You might also like