Professional Documents
Culture Documents
Quality Stage Wipro
Quality Stage Wipro
QualityStage
Essentials
Objectives
Data Quality
Introduction to QualityStage
Developing with QualityStage
Investigate and Data Quality Assessment
Data Preparation
Standardize
Rule Set Overrides
Match
Survive
Data Migration Challenges
Data Quality
Business/Data
Business/Data Modeling
Modeling
Business
Business Rule
Rule Analysis
Managing
Managing Metadata
Metadata
0 5 10 15 20 25 30 35 40
Percent
Data Quality Increases ROI
NAME 1
Robert A. Jones TTE Robert Jones Jr.
ADDRESS 1 First Natl Provident
ADDRESS 2 FBO Elaine & Michael Lincoln UTA
ADDRESS 3
DTD 3-30-89 59 Via Hermosa
ADDRESS 4
c/o Colleen Mailer Esq
Seattle, WA 98101-2345
ADDRESS 5
The Anomalies Nightmare
87121 Pete & Soph Lalond 76 George Road Boston MASS FR Alert
Standardizing
Understanding content, structure
Identifying Selecting the
the quality and meaning of data
and linking “Best-of-breed”
of your data and in preparation for
duplicate entities data for downstream
it’s impact on matching and
or like entities processing
achieving success downstream processing
Do your Which Does How do Can you Can you Are you able Can you Is your
data source your data you correct make the to keep data deliver & data sent
sources should mean match and data synchronized update to users
contain you use what you records improve meaningful across the data based on
what you for this think it with the the to users? systems? in a timely events or
think project? does? same quality of manner? content?
they do? meaning your
? data?
Why Investigate
Floor Value
Street Type
Unit Value
Suffix Dir.
Box Value
Additional
Unit Type
Rte Value
Pre -Dir.
Address
Address
NYSIIS
House
House
Suffix
Name
Name
Street
Street
Info
326 W 17 ST 326 W 17TH ST 17T
200 E.27TH STREET APT. 10H 200 E 27TH ST APT 10H 27T
168 FIRST AVE. 168 1ST AVE 1
35 PIERREPONT STREET APT.#3-B 35 PIERREPONT ST APT 3B PARAPAN
76-D LA BONNE VIE II DR 76 D LA BONNE VIE II DR LABANAVY
1560 BROADWAY SUITE 416 1560 BROADWAY STE 416 BRADWY
50 FAIRVIEW DRIVE SOUTH 50 FAIRVIEW DR S FARV
247 DOVER GRN 247 DOVER GRN DAVAR
3530 HENRY HUDSON PKWY E APT 8D 3530 HENRY HUDSON PKWY E APT 8D
2951 W 33 ST APT 3C 2951 W 33RD ST APT 3C 33D
425 E 8TH ST (2ND FLOOR) 425 E 8TH ST 2 8T
305 WEST 98TH ST APT #4AN 305 W 98TH ST APT 4AN 98T
37-06 100 STREET /FIRST FL 37 06 100TH ST 1 100T
ONE FIFTH AVENUE 1 5TH AVE 5T
1 5TH AVE APT 15G 1 5TH AVE APT 15G 5T
P O BOX 2257 666 ANDERSON AVE 666 ANDERSON AVE 2257 ANDARSAN
Match
Type Wgt SSN Input Name Input Address Input City St Zip Title Sal. Maiden Name DOB
XA 29.73 092-52-1195 JEROME /LOFFREDO PO BOX 40206 BROOKLYN NY 11204-0206 SCUNNAIMINA 19640110
DA 29.73 092-52-1195 JEROME /LOFFREDO PO BOX 40206 BROOKLYN NY 11204-0206 19640100
XA 29.73 058-09-8019 HARRY W /BOGARDS PO BOX 845 PORT WASHINGTON NY 11050-0202 JR VAN GURP 19120920
DA 7.16 058-09-8019 HARRY /BOGAARDS P O BOX 845 PORT WASHINGT NY 110500202 0
XA 19.29 261-60-5676 ADRIAN /GARCIA ROCKEFELLER CENTER P O BOX 1062 NEW YORK NY 10020 19300908
DA 19.29 000-00-0000 ADRIAN /GARCIA P O BOX 1062 ROCKEFELLER CNTR NEW YORK NY 10185 0
XA 62.78 050-36-6598 GLORIA P /LEONNELL 1655 FLATBUSH AVE APT B302 BROOKLYN NY 11210-3271 19460410
DA 33.09 050-36-6598 GLORIA P /LEONNELL-WILLIAMS
1655 FLATBUSH AVE BROOKLYN NY 11210-3276 HILL 19460410
XA 62.78 053-52-8625 WILLIAM /LOCKLEY 10516 FLATLANDS 9TH ST BROOKLYN NY 11236-4624 BROWN 19571111
DA 62.78 053-52-8625 WILLIAM /LOCKLEY 10516 FLATLANDS 9TH ST BROOKLYN NY 11236-4624 BROWN 19571111
DA 44.08 000-00-0000 WILLIAM /LOCKLEY 105-16 FLATLANDS 9TH ST BROOKLYN NY 112364624 0
XA 54.42 414-76-9969 MARY /RICHARDSON 651 E 14TH ST NEW YORK NY 10009-3119 19451222
DA 24.73 414-76-9969 MARY P /RICHARDSON GRAY
651 E 14TH ST APT 10G NEW YORK NY 10009-3125 ROBINSON 19451222
What is Survive
Matched Survived
First Middle Last Name First Middle Last Name
Name Name Name Name
MARI LEMELSON- MARI S LEMELSON-
LAPPNER LAPPNER
MARI S LEMELSON
Example 2:
The longest populated Middle Name, Date of Birth, and SSN
Matched Survived
First
Name Middle Name
Last Name DOB SSN First Name Last NameDOB
Middle Name SSN
DENISE TRIANO 19580211 98524173 DENISE F TRIANO 19580211 98524173
DENISE F TRIANO
Data Re-engineering Methods
Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)
I II II IV
Investigation Conditioning I Matching Survivorship
Exercise 1-1:Course Project
AUTOHOME LIFE
Select US Data Select US Data
for further for further
processing processing
Investigate
Assess Data Quality Condition Name,
Address and Area
Identify Duplicate
Add Unique Key Customer Records
Investigate
Append Data to a
Conditioned Results
common format
UNIX
QualityStage
Designer
BUILD ONCE
Windows
Windows TCP/IP
TCP/IP (FTP) &
NT Server
RUN ANYWHERE
OS/390
QualityStage Designer
Designer
Client GUI for designing
projects
Windows NT, 2000, XP
Enter meta data
Define Stages
Build Jobs
Standardization Rules
Designer Repository
Designer - Toolbar
CUT, COPY, PASTE Items listed on the right pane of the work area
Deployment modes
Batch
Real-time
Real-time via API
Master Projects Directory
Project information is deployed to the server
Project work files are stored on the server in
project libraries
Directory Structure
Designer
QualityStage Designer C:\Ascential\QualityStageDesigner70
Server
QualityStage Server C:\Ascential\QualityStageServer70
QualityStage
WAVES
Postal Certification Solutions
CASS
SERP
GeoLocator
Exercise 2-1:
Configure QualityStage
Configure the Designer for the development server
Run profile
Designer Options
Server – Master Projects directory
Designer Options
Starting the QualityStage Server
During the course
Development environment
Run Profile
QualityStage Application
Project Components
Stages
Jobs
Data File Definitions
Meta data
Stages
Abbreviate • Match • Sort
Build • Multinational • Standardize
CASS Standardize
• Survive
Collapse • Parse
• Transfer
Format • Program
• Unijoin
Convert • Select
• WAVES
Investigate • SERP
• Z4changes
Designer
Import or enter file definitions and meta data defining your
sources and targets
Add stages defining the process or task
Deploy the job
Server
Run the job
Review results
Job Development Process
QualityStage Server
QualityStage
Designer
Deploy & Run Job Script
Windows
Data Assessment
Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)
I II II IV
Investigation Conditioning I Matching Survivorship
High-Level DFD
Select US Data
for further Reject NON US Data
AUTOHOME LIFE processing
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Add Unique Key Customer Records
Append Data to a Investigate
common format Conditioned Results
Features
Analyze free-form and single domain fields
Provide frequency distributions of distinct values and patterns
Investigate methods
Character Discrete
Character Concatenate
Word
Investigate Methods
Method Why
FRQ
Count FRQ % Field Mask Sample / “Example”
00000908
00000908 45.309%
45.309% bbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbb [X]
[X] ||
00000020
00000020 2.009% bbbbnnnnbbbbbbbb
2.009% bbbbnnnnbbbbbbbb [X]
[X] || 1904
1904
00001096
00001096 54.691% nnnnnnnnbbbbbbbb
54.691% nnnnnnnnbbbbbbbb [X] | 06011944
[X] | 06011944
Pattern Reports
SEPLIST “¬”
STRIPLIST “¬.“
Token1 Token2 Token3 Token4
120 Main St NW
SEPLIST “¬.”
STRIPLIST “¬.“
Token1 Token2 Token3 Token4 Token5
120 Main St N W
Data Typing: Classifying Tokens
PATTERN KEY:
^ – Numeric token 120 Main Street Apt 6C
? – Unclassified alpha token
@, <, > – Mixed Token ^ ? T U >
T – Street Type
U – Unit Type
Example: Word Investigate
ABBOTT ABBOTT ? ;0000000001
ABBOTT
ABERCON ABBOTT
ABERCON ?? ;0000000001
;0000000001
ABERCON
ABERCORN ABERCON
ABERCORN ?? ;0000000001
;0000000007
ABERCORN
ABERDEEN ABERCORN
ABERDEEN ?? ;0000000007
;0000000001
ABERDEEN ABERDEEN ? ;0000000001
Select US Data
for further Reject NON US Data
AUTOHOME LIFE processing
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format Investigate
Conditioned Results
COMBINED
Exercise 5-1:
Add a Record Key and Append Data Files
COMBINED
COMBINED
Lab 5-1:
Append LIFE to COMBINED Output
1. Create transfer stage
2. Define new record key LIFE
LIFE
field
3. Populate the record key
field Stage
Stagename:
name:LFKEY
LFKEY
Stage
Stagetype:
type:Transfer
Transfer
4. Append LIFE to Job
JobName:
Name:Append
Append
AUTOHOME in the
COMBINED output file
COMBINED
COMBINED
Module Summary
Construct Application
Development
{ Standardize Data
Find Duplicate Candidate (Match)
Survive Best of Breed (Survive)
Unit Test
Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)
I II II IV
Investigation Standardize I Match Survive
High-Level DFD
Select US Data
for further Reject NON US Data
AUTOHOME LIFE processing
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format Investigate
Conditioned Results
Transformation
Parsing free form fields
Comparison threshold for classifying like words
Bucketing data tokens
Standardization
Applying standard values and standard formats
Phonetic Coding for use in Matching
NYSIIS
Soundex
Standardize Example
Input
InputFile:
File:
Address Line 1 Address Line 2
Address Line 1 Address Line 2
1721
1721WW ELFINDALE
ELFINDALEST
ST UNIT
UNIT20
20
1721
1721 W ELFINDALE ST##20
W ELFINDALE ST 20
16200 VENTURA BOULEVARD
16200 VENTURA BOULEVARD SUITE
SUITE201
201
C/O
C/OJOSEPH
JOSEPHCCREIFF
REIFF 12
12 WESTERNAVE
WESTERN AVE
1705 W St
1705 W St PHILADELPHIA
PHILADELPHIA
1655
1655PONCE
PONCEDEDELEON
LEONAVENUE
AVENUE 15
15TH FLOOR
TH
FLOOR
Result
ResultFile:
File:
House # Dir Str. Name Type Unit Unit. Floor Floor NYSIIS Soundex City
House # Dir Str. Name Type Unit Unit. Floor Floor NYSIIS Soundex City
Type Value Type Value
Type Value Type Value
1721
1721 WW ELFINDALE
ELFINDALE AVE
AVE UNIT
UNIT 20
20
1721
1721 WW ELFINDALE
ELFINDALE AVE
AVE 20
20
16200
16200 VENTURA
VENTURA BLVD
BLVD
12
12 WESTERN
WESTERN AVE
AVE
1705
1705 WW ST
ST PHILADELPHIA
PHILADELPHIA
1655
1655 PONCE
PONCEDE
DELEON
LEON AVE
AVE FLOOR
FLOOR15
15
Standardize Process
Classify &
assign default tags ^ ? T U ^
Parse 21 WINGATE STREET APARTMENT 601
Standardize Stage
Standardize Stage
Uses Rule sets for:
Country processing
Pre-domain processing
– USPREP
Domain processing
– USADDR
– USAREA
– USNAME
Multi-national Address
WAVES
Types of Rule Sets
Country
CountryIdentifier
Identifier
COUNTRY
COUNTRY
Domain
DomainPre-processor
Pre-processor
USPREP
USPREP
Domain
DomainSpecific:
Specific: Domain
DomainSpecific:
Specific: Domain
DomainSpecific:
Specific:
USNAME
USNAME USADDR
USADDR USAREA
USAREA
Example: Country Identifier
Input
InputRecord
Record
100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,MAMA02111
02111
SITE
SITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLEAB
ABT0L
T0L1K0
1K0
28
28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
123
123MAIN
MAINSTREET
STREET
Output
OutputRecord
Record
US
US YY 100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,MAMA02111
02111
CA
CA YY SITE
SITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLEAB
ABT0L
T0L1K0
1K0
GB
GB YY 28
28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
US
US NN 123
123MAIN
MAINSTREET
STREET
Example: Domain Pre-Processor
Input
InputRecord
Record
Field
Field11 JIM
JIMHARRIS
HARRIS(781)
(781)322-2426
322-2426
Field
Field22 92
92DEVIR
DEVIRSTREET
STREETMALDEN
MALDENMAMA02148
02148
Output
OutputRecord
Record
Name
NameDomain
Domain JIM
JIMHARRIS
HARRIS
Address
AddressDomain
Domain 92
92DEVIR
DEVIRSTREET
STREET
Area
AreaDomain
Domain MALDEN
MALDENMA MA02148
02148
Other
OtherDomain
Domain (781)
(781)322-2426
322-2426
Example: Domain-Specific
Input
InputRecord
Record
100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOOR
Output
OutputRecord
Record
House
HouseNumber
Number 100
100
Street
StreetName
Name SUMMER
SUMMER
Street
StreetSuffix
SuffixType
Type ST
ST
Floor
FloorType
Type FL
FL
Floor
FloorValue
Value 15
15
Address
AddressType
Type SS
NYSIIS
NYSIISofofStreet
StreetName
Name SANAR
SANAR
Reverse
ReverseSoundex
Soundexof ofStreet
StreetName
Name R520
R520
Input
InputPattern
Pattern ^+T>U
^+T>U
Rule Sets
Dictionary File (.DCT) Define the output file fields to store the
parsed and conditioned data
Rule Set Description (.PRC) Description file for the Rule Set
Class Description
^ A single numeric
+ A single unclassified alpha (word)
? One or more consecutive unclassified alphas
@ Complex mixed token, e.g., ½, O’Connell
> Leading numeric, e.g., 6A
< Trailing numeric, e.g. A6
Zero Null class
User-defined Classes
Class Description
USNAME
G Generational, e.g., Senior, I, II
P Prefix, e.g. Dr., Mr., Miss
USADDR
T Street Type
D Directional
B Box Type
USAREA
S State Abbreviation
Comparison Threshold
May be used in the
Threshold level
Classification table
900 Exact match
Used to efficiently make
entries into the classification 850 Almost certainly the same
table Most likely equivalent
800
Helps overcome spelling and
750 Most likely not the same
data entry errors
Not required 700 Almost certainly not the same
Classify &
assign default tags ^ ? T U ^
Parse 21 WINGATE STREET APARTMENT 601
Standardizing International Data
Two methods
Method 1: Use country pre-processor, domain pre-processor,
and domain-specific rules
Uses out-of-the-box, included functionality/rules
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format Investigate
Conditioned Results
CNTRYOUT
CNTRYOUT
High-Level DFD
Select US Data
for further
Reject NON US Data
AUTOHOME LIFE processing
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format Investigate
Conditioned Results
USDATA
USDATA NONUSDATA
NONUSDATA
(Accept)
(Accept) (Reject)
(Reject)
Domain Pre-Processor Rule Sets
PREPOUT
PREPOUT
Domain Rule Sets
Domain rule sets expect only data for that domain as the
input
Domain rule sets that come with QualityStage are:
Name
Street address
Area (city, state and zip)
High-Level DFD
Select US Data
for further
processing
Reject NON US Data
AUTOHOME LIFE
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format Investigate
Conditioned Results
STANOUT
STANOUT
Standardize Results
Business Intelligence fields
Parsed from the original data, they may be used in matching
and generally they are moved to the target system
Matching Fields
Generally these fields are created to help during the match
process and are dropped after successful matching
Reporting fields
Specifically created to help review results of Standardize and
recognized handled and unhandled data
Business Intelligence Fields
Phonetic coding
NYSIIS
Reverse NYSIIS
Soundex
Reverse Soundex
Hash keys
First 2 characters of the first five words
Packed Keys
Data concatenated, or packed
Standardize Reporting Fields
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Customer Records
Add Unique Key
Append Data to a
common format Investigate
Conditioned Results
Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)
I II II IV
Investigation Standardize I Match Survive
High-Level DFD
Select US Data
for further
Reject NON US Data
AUTOHOME LIFE processing
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Customer Records
Add Unique Key
Append Data to a
common format Investigate
Conditioned Results
CLASSIFICATION TABLE
USER CLASSIFICATION
INPUT TEXT
INPUT PATTERN
UNHANDLED TEXT
UNHANDLED PATTERN
Rule Set Override Process
1. Enter override
2. Apply override
3. Test override with the Rules Analyzer
4. Repeat steps 1 through 3 for all desired overrides
Exercise 7-1:
Name Rule Set User Override
Review the unhandled NAME patterns in the
INUPNAMp.frq report
Apply NAME overrides
Test NAME overrides
Re-run the STAN Job to re-produce the new output file
with the overrides applied
Exercise 7-2:
Address and Area Overrides
Review the Investigation reports of unhandled Address
and Area data
Apply Users Overrides to unhandled data
Test the Override
Re-run the STAN Job to re-produce the new output file
with the overrides applied
Module Summary
Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)
I II II IV
Investigation Standardize I Match Survive
High-Level DFD
Select US Data
for further
Reject NON US Data
AUTOHOME LIFE processing
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Customer Records
Add Unique Key
Append Data to a
common format Investigate
Conditioned Results
Which of the following record pairs is a match? And how do you know?
W HOLDEN 12 MAIN ST
W HOLDEN 12 MAINE ST
3500
# of Pairs
3000
2500
2000
Thresholds defined can be
1500 used for automated
1000 Non-Matches processing
500
0
Matches
-20 -10 0 10 20 30 40 Weight of Comparisons
rey area
False Negatives Records below the low cutoff that really are
a match
Measuring the
Conditions of Uncertainty
Reliability of the data in a given field
Estimated as the probability that the field agrees given the
record pair is a match
Probability of a random agreement of values
Estimated as the probability the field agrees given the record
pair is not a match
Reliability (M-Probability)
M (m-prob) = .9
U (u-prob) = .01
Accuracy Scope
The quality of the candidate records The number of records
Blocking Strategy
Unduplication
Identifies duplicates candidates in one file
Match (Two File)
One-to-one correspondence
For every record on File A we expect to find a match to one
record on File B
Geomatch (Two File)
Many-to-one correspondence
More than one record on File A can match to the same record
on File B
Comparing Data Values
Different comparisons for different data
Over 24 comparison methods
Most common
CHAR - (character comparison) character by character, left to right.
UNCERT - (character uncertainty) tolerates phonetic errors,
transpositions, random insertion, deletion, and replacement of characters
CNT_DIFF – Counts keying errors in numeric fields. You set a tolerance
threshold
NAME_UNCERT – Can be used to compare and character values, if the
strings are different lengths then the shorter of the two lengths is used
Exercise 8-1:
Undup Match
1. Define the output file
2. Define the Match Stage
3. Define the pass
• Choose blocking fields
4. Choose fields to compare and comparison method
5. Build the Match Extract
6. Create the Pass
Match Output Files
Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)
I II II IV
Investigation Standardize I Match Survive
High-Level DFD
Select US Data
for further
Reject NON US Data
AUTOHOME LIFE processing
Pre-Process US Data
Investigate
Assess Data Quality
Condition Name,
Address and Area
Identify Duplicate
Customer Records
Add Unique Key
Append Data to a
common format Investigate
Conditioned Results
Survive the Best
Customer Record
Cross-Reference File
Group Legacy
1 D150
1 A1367
23 D689
23 A436
23 D352
Survive Rules
Pre-defined Techniques
Source
Recency
Frequency
Most complete (longest string)
User-specified logic
Target Fields
FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;
TARGET CONDITION
Exercise 9-1:
Survive the Best Customer Record
1. Define the output file
2. Define Survive stage
3. Choose target fields
4. Define Survive rules
5. Deploy and run
6. Review results
Module Summary