You are on page 1of 240

Session 1:

QualityStage
Essentials
Objectives

 Data Quality
 Introduction to QualityStage
 Developing with QualityStage
 Investigate and Data Quality Assessment
 Data Preparation
 Standardize
 Rule Set Overrides
 Match
 Survive
Data Migration Challenges

Data Quality

Legacy Data Scrubbing

Managing End-user Expectations

Business/Data
Business/Data Modeling
Modeling

Managing Mgmt Expectations


Expectations

Business
Business Rule
Rule Analysis

Managing
Managing Metadata
Metadata

0 5 10 15 20 25 30 35 40
Percent
Data Quality Increases ROI

 Better decision making


 Improved marketing accuracy and scope
 Increased knowledge of customers
 Improved inventory and asset management
 Improved risk analysis, auditing and reporting
Data Quality

There are two significant definitions of data quality


 Inherent Data Quality
 Correctness or accuracy of data - The degree to which data accurately
reflects the real-world object that it represents

 Pragmatic Data Quality


 The value that accurate data has in supporting the work of the
enterprise
 Data that does not help enable the enterprise accomplish its mission has
no quality, no matter how accurate it is
Data Quality Challenges

 Different or inconsistent standards in structure, format


or values
 Missing data, default values
 Spelling errors, data in wrong fields
 Buried information
 Data myopia
 Data anomalies
Different or Inconsistent Standards

Name Field Location


MARK DI LORENZO MA93
Source 1 DENIS E. MARIO CT15
TOM & MARY ROBERTS IL21

DILORENZO, MARK 6793


Source 2 MARIO, DENISE 0215
ROBERTS, TOM & MARY 8721

MARC DILORENZO ESQ BOSTON


Source 3 MRS DENNIS MARIO HARTFORD
MR & MRS T. ROBERTS CHICAGO
Missing Data & Default Values

Do the field values match the meta data labels?

NAME SOC. SEC. # TELEPHONE

Denise Mario DBA 228-02-1975 6173380300


Marc Di Lorenzo ETAL 999999999 3380321
Tom & Mary Roberts 025-37-1888
First Natl Provident 34-2671434 415-392-2000
101010101 508-466-1200
Astorial Fedrl Savings LN#12-756 212-235-1000
Kevin Cooke, Receiver 18-7534216 FAX 528-9825
John Doe Trustee for K 111111111 5436
Buried Information

Legacy Meta Desc. Legacy Record Values

NAME 1
Robert A. Jones TTE Robert Jones Jr.
ADDRESS 1 First Natl Provident
ADDRESS 2 FBO Elaine & Michael Lincoln UTA
ADDRESS 3
DTD 3-30-89 59 Via Hermosa
ADDRESS 4
c/o Colleen Mailer Esq
Seattle, WA 98101-2345
ADDRESS 5
The Anomalies Nightmare

CUSNUM NAME ADDRESS SALES $

90328574 IBM 187 N.Pk. Str. Salem NH 01456 8,494.00


90328575 I.B.M. Inc. 187 N.Pk. St. Sarem NH 01456 3,432.00
90238495 International Bus. M. 187 No. Park StSalem NH 04156 2,243.00
90233479 Int. Bus. Machines 187 Park Ave Salem NH 04156 5,900.00
90233489 Inter-Nation Consults 15 Main St. Andover MA 02341 6,800.00
90234889 Int. Bus. Consultants PO Box 9 Boston MA 02210 10,243.00
90345672 I.B. Manufacturing Park Blvd. Boston MA 04106 15,999.00

No common key Lack of Standards


Anomalies Spelling Errors
What data challenges do you face?
Data entry errors and No consistent naming
misspellings convention
Missing values or data
Business terms and Buried information in the wrong fields
spillover text
Acct # Name Address City State Zip Note

5154155 Peter J. Lalonde 40 Beacon St. Melrose, Mass 02176 ODP

5152335 LaLonde, Peter 76 George 617-210-0824 Boston YES MA 02111

5146261 Lalonde, Sofie 40 Bacon Street Melrose MA CHK ID

87121 Pete & Soph Lalond 76 George Road Boston MASS FR Alert

87458 P. Lalonde FBO S.Lalonde40 Becon Rd. Melrose MA 02 176

No unique key linking


records together
Common Data Quality Approaches
 Analysis and Assessment
 Enterprise-level: Data Quality Assessment (DQA)
 Project-level: Data investigation
 Data Re-engineering Methods
 Standardization
 Record linkage/matching
 Consolidation
 Information Engineering Methods
 Initial load
 Net change
 Real-time
 Ongoing Metrics
 Project-level: Post-Data Quality Assessment (PDQA)
 Enterprise-level: Repeated DQA’s to establish trends
Data Re-engineering Methodology

Discover Standardize Linkage Consolidate


Investigate Condition Matching Survivorship

Standardizing
Understanding content, structure
Identifying Selecting the
the quality and meaning of data
and linking “Best-of-breed”
of your data and in preparation for
duplicate entities data for downstream
it’s impact on matching and
or like entities processing
achieving success downstream processing

Do your Which Does How do Can you Can you Are you able Can you Is your
data source your data you correct make the to keep data deliver & data sent
sources should mean match and data synchronized update to users
contain you use what you records improve meaningful across the data based on
what you for this think it with the the to users? systems? in a timely events or
think project? does? same quality of manner? content?
they do? meaning your
? data?
Why Investigate

 Discover trends and potential anomalies in the data


 100% visibility of single domain and free-form fields
 Identify invalid and default values
 Reveal undocumented business rules and common
terminology
 Verify the reliability of the data in the fields to be used
as matching criteria
 Gain complete understanding of data within context
How to Investigate
• Single domain (character and type)
• Freeform text (Word)
Name – ‘Word’ Pattern Legend
Class Description Class Description
Summary: ? Unknown W Organization Words
N Salutation A Abbreviations
 100% populated
< Mixed (Leading alpha) ^ Numeric
 The top four patterns account > Mixed (Leading Numeric) @ Complex
for 81% of the populated I Initials F First Name
other The character itself. For example: (, ), *, %
values
 Contains ‘slash’ that may
indicate last name
 Contains full and partial first name

Frequency Percent Pattern Data Sample


3,533,119 39.332% FI? ADINA A /ACEVERO
2,590,837 28.842% F? CARMEN /ABRAHANTE
614,006 6.835% ?? ISHAI /BIRAN
552,579 6.152% FIF SCOTT J /ALBERT
331,279 3.688% FF JENNIFER /ASUNCION
314,199 3.498% ?I? CLAUDVILLE P /ADAMS
154,026 1.715% FF? ELIZABETH ANN /ABELA
What is Standardize

“The revealed patterns drive the conditioning rules.”


 Pattern Manipulation:
 Applying business logic to data chaos.
  Standards Definition:
 Enforcing business standards on data elements.
 Field Structuring:
 Transforming the input to an output which meets the business
requirement.
How to standardize
 Parsing specific data fields into smaller, lower-level
(atomic) data elements
 Categorization of identified elements
 Separation of Name, Address, and Area from freeform Name &
Address lines
 Identification of Distinct Material Categories (e.g. Sutures vs.
Orthopedic Equipment)
 Refinement of a data element
 Name = ‘MS GRACY E MATHEWS’ becomes:
Title = ‘MS’
First Name = ‘GRACY’
Middle Name = ‘E’
Last Name = ‘MATHEWS’
 Part Description = ‘BLK ACER MONITOR ’ becomes:
Color = ‘BLACK’
Type = ‘ACER ’
Part = ‘MONITOR’
Why Standardize

 Normalize values in data fields to standard values


 Transform First Name = ‘MIKE’  ‘MICHAEL’
 Transform Title = ‘Doctor’  ‘Dr’
 Transform Address = ‘ST. Michael Street’  ‘Saint Michael
St.’
 Transform Color = ‘BLK’  ‘BLACK’
 Applying phonetic coding to key words
 NYSIIS
 Soundex
 Typically applied to Name fields (first, last, street, city)
QualityStage Standardize

 Highly flexible pattern recognition language


 Field or domain specific standardization (i.e. unique
rules for names vs. addresses vs. dates, etc.)
 Customizable classification and standardization tables
 Utilizes results from data investigation
QualityStage Standardize Example
Name Standardization Example

Input Name “Bucketed” Name Information after INTEGRITY Standardization


Name Type Gdr Prefix First Middle Last Suffix Gen NYSIIS Match First Additional Name
CHESTER FINANCIAL /INC O CHESTER FINANCIAL INC CASAR
MIGUEL A /DEJESUS-VAZQUEZ I M MIGUEL A DEJESUS-VAZQUEZ DAJAS MIGUEL
DEBBIE KOTIN /INSDORF ESQ. I F DEBBIE KOTIN INSDORF ESQ INSDARF DEBORAH
DURAND,RAYMOND J. I M RAYMOND J DURAND DARAN RAYMOND
JOHN FRANCIS /ECKSTEIN IV I M JOHN FRANCIS ECKSTEIN IV ECSAN JOHN
BOB T /HSIEH I M BOB T HSIEH HSAH ROBERT
MOST REV. VINCENT D. /BREEN I M MOST REV VINCENT D BREEN BRAN VINCENT
MISS DOROTHY /MEAGHER I F MISS DOROTHY MEAGHER MAGAR DOROTHY
MINISTER ROXANN D /ROBINSON I F MINISTER ROXANN D ROBINSON RABANSAN ROXANN
LUCIEN D /MOCOMBE MD I M LUCIEN D MOCOMBE MD MACANB LUCIEN
FRANK /MCCORD III I M FRANK MCCORD III MCAD FRANK
JOHN L /HANCOCK 111 I M JOHN L HANCOCK III HANCAC JOHN
ONLINE BANKING /TEST1015 ONLINE BANKING TEST1015
ON LINE BANKING /TEST #1120 ON LINE BANKING TEST 1120
ITA5 /TEST ITA5 TEST
RABBI JEROME M /BLUM I M RABBI JEROME M BLUM BLAN JEROME
FRANCES /WILLIAMS JONES I M FRANCES WILLIAMS JONES JAN FRANCES
QualityStage Standardize Example
Address Standardization Examples

Input Address “Bucketed” Address Information after INTEGRITY Standardization

Floor Value
Street Type

Unit Value
Suffix Dir.

Box Value

Additional
Unit Type

Rte Value
Pre -Dir.
Address

Address
NYSIIS
House

House
Suffix

Name

Name
Street

Street

Info
326 W 17 ST 326 W 17TH ST 17T
200 E.27TH STREET APT. 10H 200 E 27TH ST APT 10H 27T
168 FIRST AVE. 168 1ST AVE 1
35 PIERREPONT STREET APT.#3-B 35 PIERREPONT ST APT 3B PARAPAN
76-D LA BONNE VIE II DR 76 D LA BONNE VIE II DR LABANAVY
1560 BROADWAY SUITE 416 1560 BROADWAY STE 416 BRADWY
50 FAIRVIEW DRIVE SOUTH 50 FAIRVIEW DR S FARV
247 DOVER GRN 247 DOVER GRN DAVAR
3530 HENRY HUDSON PKWY E APT 8D 3530 HENRY HUDSON PKWY E APT 8D
2951 W 33 ST APT 3C 2951 W 33RD ST APT 3C 33D
425 E 8TH ST (2ND FLOOR) 425 E 8TH ST 2 8T
305 WEST 98TH ST APT #4AN 305 W 98TH ST APT 4AN 98T
37-06 100 STREET /FIRST FL 37 06 100TH ST 1 100T
ONE FIFTH AVENUE 1 5TH AVE 5T
1 5TH AVE APT 15G 1 5TH AVE APT 15G 5T
P O BOX 2257 666 ANDERSON AVE 666 ANDERSON AVE 2257 ANDARSAN
Match

“Conditioned data and QualityStage’s matching engine


link the previously unlinkable.”
 Match Construction:
 Reliability of input data defines a match result.
    Statistical Analysis & Match Scoring:
 Linkage probability determined on a sliding scale by field level
comparison.
    Report Generation:
 All business rules applied have easy to understand report
structure.
What is Match

 Identifying all records on one file that correspond to


similar records on another file
 Identifying duplicate records in one file
 Building relationships between records in multiple files
 Performing statistical and probabilistic matching
 Calculating a score based on the probability of a match
How to Match
 Single file (Unduplication) or two file (Geomatch)
 Different match comparisons for different types of data
(e.g. exact character, uncertainty/fuzzy match, keystroke
errors, multiple word comparison …)
 Generation of composite weights from multiple fields
 Use of probabilistic or statistical algorithms
 Application of match cutoffs or thresholds to identify
automatic and clerical match levels
 Incorporation of override weights to assess particular
data conditions (e.g. default values, discriminatory
elements)
QualityStage Match
 Over 25 match comparison algorithms providing a full
spectrum of fuzzy matching functions
 Statistically-based method for determining matches
(Probabilistic Record Linkage Theory)
 Field-by-field comparisons for agreement or
disagreement
 Assignment of weights or penalties
 Overrides for unique data conditions
 Score results to determine the probability of matched
records
 Thresholds for final match determination
 Ability to measure informational content of data
QualityStage Match Examples

Type Wgt SSN Input Name Input Address Input City St Zip Title Sal. Maiden Name DOB

XA 29.73 092-52-1195 JEROME /LOFFREDO PO BOX 40206 BROOKLYN NY 11204-0206 SCUNNAIMINA 19640110
DA 29.73 092-52-1195 JEROME /LOFFREDO PO BOX 40206 BROOKLYN NY 11204-0206 19640100

XA 29.73 999-99-9999 ANGEL A /MOUMDJIAN PO BOX 16 ALPINE NJ 07620-0016 0 19250101


DA 29.73 000-00-0000 ANGEL /MOUMDJIAN PO BOX 16 ALPINE NJ 07620-0016 X 19250101

XA 29.73 058-09-8019 HARRY W /BOGARDS PO BOX 845 PORT WASHINGTON NY 11050-0202 JR VAN GURP 19120920
DA 7.16 058-09-8019 HARRY /BOGAARDS P O BOX 845 PORT WASHINGT NY 110500202 0

XA 19.29 261-60-5676 ADRIAN /GARCIA ROCKEFELLER CENTER P O BOX 1062 NEW YORK NY 10020 19300908
DA 19.29 000-00-0000 ADRIAN /GARCIA P O BOX 1062 ROCKEFELLER CNTR NEW YORK NY 10185 0

XA 62.78 050-36-6598 GLORIA P /LEONNELL 1655 FLATBUSH AVE APT B302 BROOKLYN NY 11210-3271 19460410
DA 33.09 050-36-6598 GLORIA P /LEONNELL-WILLIAMS
1655 FLATBUSH AVE BROOKLYN NY 11210-3276 HILL 19460410

XA 62.78 053-52-8625 WILLIAM /LOCKLEY 10516 FLATLANDS 9TH ST BROOKLYN NY 11236-4624 BROWN 19571111
DA 62.78 053-52-8625 WILLIAM /LOCKLEY 10516 FLATLANDS 9TH ST BROOKLYN NY 11236-4624 BROWN 19571111
DA 44.08 000-00-0000 WILLIAM /LOCKLEY 105-16 FLATLANDS 9TH ST BROOKLYN NY 112364624 0

XA 54.42 414-76-9969 MARY /RICHARDSON 651 E 14TH ST NEW YORK NY 10009-3119 19451222
DA 24.73 414-76-9969 MARY P /RICHARDSON GRAY
651 E 14TH ST APT 10G NEW YORK NY 10009-3125 ROBINSON 19451222
What is Survive

 Creation of best-of-breed “surviving” data based on


record or field level information
 Development of cross-reference file of related keys
 Production of load exception reports
 Creating output formats:
 Relational table with primary and foreign keys
 Transactions to update databases
 Cross-reference files, synonym tables
Why survive

 Provide consolidated view of data


 Provide consolidated view containing the “best-of-
breed” data
 Resolve conflicting values and fill missing values
 Cross-populate best available data
 Implement business and mapping rules
 Create cross-reference keys
How to survive

 Highly flexible rules


 Record or field level survivorship decisions
 Rules can be based upon data frequency, data recency
(i.e. date), data source, value presence or length
 Rules can incorporate multiple tests
 QualityStage features
 Point-and-click (GUI-based) creation of business rules to
determine best-of-breed “surviving” data
 Performed at record or field level
QualityStage Survive Examples
Example 1:
The longest populated Middle and Last Name

Matched Survived
First Middle Last Name First Middle Last Name
Name Name Name Name
MARI LEMELSON- MARI S LEMELSON-
LAPPNER LAPPNER

MARI S LEMELSON

Example 2:
The longest populated Middle Name, Date of Birth, and SSN

Matched Survived
First
Name Middle Name
Last Name DOB SSN First Name Last NameDOB
Middle Name SSN
DENISE TRIANO 19580211 98524173 DENISE F TRIANO 19580211 98524173
DENISE F TRIANO
Data Re-engineering Methods

Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)

I II II IV
Investigation Conditioning I Matching Survivorship
Exercise 1-1:Course Project

 Course business case: WINN Insurance CRM project


 See QualityStage Essentials Exercises, page 4
Course Project Design

AUTOHOME LIFE
Select US Data Select US Data
for further for further
processing processing

Investigate
Assess Data Quality Condition Name,
Address and Area

Identify Duplicate
Add Unique Key Customer Records
Investigate
Append Data to a
Conditioned Results
common format

Survive the Best


Customer Record

Standardize Country Apply User Overrides


Module Summary

 Five common data quality contaminants


1. Different standards
2. Missing and default values
3. Spillover and buried information
4. Anomalies
5. No consolidated view
 Approaches to Data Quality
 Data re-engineering methods
Introduction to QualityStage
Why QualityStage
 Probabilistic record linkage results in highest level
of accurate, complete and justifiable match rates
 Most flexible parsing/standardization capabilities
Handles complex free-form data
 Ability to verify 200+ country addresses allows for
global support
 Transparent parallelism exploits multiple CPUs
which provides unmatched performance and
scalability
 Bi-directional meta data exchange ensures users
understand data
 Productivity, connectivity and interoperability via
tight integration with DataStage and RTI Services
QualityStage Architecture

QualityStage Server Platforms

UNIX

QualityStage
Designer
BUILD ONCE
Windows
Windows TCP/IP
TCP/IP (FTP) &
NT Server
RUN ANYWHERE
OS/390
QualityStage Designer

 Designer
 Client GUI for designing
projects
 Windows NT, 2000, XP
 Enter meta data
 Define Stages
 Build Jobs
 Standardization Rules

 Designer Repository
Designer - Toolbar

NEW Project, Data File definition, Data Field definition,


Stage, or Job

CUT, COPY, PASTE Items listed on the right pane of the work area

DISPLAY Change display of right pane to Large icons,


Small icons or show Details

RUN The job selected on the right pane


Designer - Rule Sets
 Pre-defined rules for parsing and
standardizing:
 Name
 Address
 Area (City, State and Zip)
 Multi-national address processing
 Validate structure:
 Tax ID
 US Phone
 Date
 Email
 Append ISO country codes
 Pre-process or filter name, address
and area
 Rule sets are stored locally with the
Designer (separate from the
repository)
Designer Rule Set Options

 The name and location


is defined in the Designer
– File, Designer Options,
Standardize Process
Definition Dictionary
Quality Stage Server

 Deployment modes
 Batch
 Real-time
 Real-time via API
 Master Projects Directory
 Project information is deployed to the server
 Project work files are stored on the server in
project libraries
Directory Structure

Designer
QualityStage Designer C:\Ascential\QualityStageDesigner70

Designer Repository C:\Ascential\QualityStageDesiger70\QualityStageDesigner.mdb

Rule Sets C:\Ascential\QualityStageDesigner70

Server
QualityStage Server C:\Ascential\QualityStageServer70

Master Projects Directory C:\Projects

Sample Project Directory C:\Projects\Quality

Sample Project Results C:\Projects\Quality\Data


Master Projects Directory

 Master projects directory resides on the server


 Multiple users can share the same Master Projects and
Project directory
 All project libraries are stored under the Master
Projects directory
Project Libraries
 Project libraries are stored under the Master Projects
directory

Project Library Description


Ipe_env.sh QualityStage Environment shell
Controls Stage and job control members
Data Location of input and output files
DIC Stage and job dictionary

IPICFG Environment configuration

Logs Location of job run logs

Scripts Job scripts – dependent on the server type


Temp Temp work space
QualityStage Licensed Stages

 QualityStage
 WAVES
 Postal Certification Solutions
 CASS
 SERP

 GeoLocator
Exercise 2-1:
Configure QualityStage
 Configure the Designer for the development server
 Run profile
 Designer Options
 Server – Master Projects directory
 Designer Options
 Starting the QualityStage Server
 During the course
 Development environment
Run Profile

 One or multiple profiles


 Defines for the Designer
the server component
location and access
 Required:
– Host Type
– Host Server Path
– Master Project Directory
 Optional:
– Alternate Locale
– Local Report Data Location
Run Profile: Adv Project Settings

Location of the input and


output data files

Location of the control


members for each stage
and job

Server temporary work


location

Logs for each stage


and job

Scripts to execute jobs


Run Profile: FTP Settings

If you are connecting to a


remote server then you
need the login ID and
password for the server.
QualityStage Designer Options

Local working temp


directory on your local PC

Location of the rule sets

Default location for


importing projects

Preferred editor for


reviewing rule sets and
result file
Module Summary
 QualityStage
 Components
 Architecture
 Communication: Designer and Server use TCP/IP (FTP) to communicate
 Configuration
 User Profile
 Designer Options
 Starting the Server
 Projects
 Projects are defined in the Designer
 To run and execute jobs, QualityStage Server must be running
 Project libraries are stored on the server
Developing with QualityStage
Module Objectives

 Introduce the concepts, components and methods for


developing projects in QualityStage
 After this module you will be able to:
 Define data files and field definitions
 Build Stages and design Jobs
 Deploy and run Jobs
 Locate and review results
Application Components

 QualityStage Application
 Project Components
 Stages
 Jobs
 Data File Definitions
 Meta data

 File Name Requirements


Stages

Stages
 Abbreviate • Match • Sort
 Build • Multinational • Standardize
 CASS Standardize
• Survive
 Collapse • Parse
• Transfer
 Format • Program
• Unijoin
Convert • Select
• WAVES
 Investigate • SERP
• Z4changes

** Licensed Stages – additional licensing required


What is a Job?

 A job is an executable QualityStage program


 Jobs can be run interactively or in batch mode
 In this course, jobs will be run interactively under the
control of QualityStage Designer
Job Development Overview

 Designer
 Import or enter file definitions and meta data defining your
sources and targets
 Add stages defining the process or task
 Deploy the job
 Server
 Run the job
 Review results
Job Development Process

1. Define data files


 Enter or import meta data
2. Define and build stages
3. Define job
4. Deploy the job
 Move project definitions to project libraries on the server
5. Run the job
6. Review results
Executing a Job: Deploy and Run

QualityStage Server

Deploy: Transfer project


information to the server

QualityStage
Designer
Deploy & Run Job Script
Windows

RUN: Execute the job script on


the server
QualityStage Job Run Modes
Process each record through a job
before passing all the records to the next
FILE MODE job

Process each record and then pass it


DATA STREAM immediately on to the next job
Exercise 3-1:
Deploy and Run
1. Open the demo project Quality
2. Select a job
3. Select the Run button on the toolbar
4. Uncheck the Deploy box
5. Choose “Execute File Mode”
6. Choose “Run from Start to End”
7. Review project libraries on the server
Data File Formats and Definitions

 Data File Names


 One to eight characters
 No spaces or extensions
 File names are uppercase and case-sensitive
 Data File Location
 Data folder in project library
 Formats
 QualityStages processes fixed record length sequential files
 Alphanumeric characters
Exercise 3-2:
Define a Project
1. Choose, New icon from the Tool Bar
2. Choose Project
3. Project Name: WinnCRM
4. Project Description: Winn Insurance CRM Project
5. Choose OK
Defining Meta Data

 Data field definitions can be entered or imported into


the Designer
 Importing options include:
 Cobol copybooks
 ODBC enabled
 MetaStage
 Metabroker
 Visual Warehouse
Exercise 3-3:
Define a Data File
1. Left pane, select Data File Definitions
2. Right pane, right-click, select New File
3. Filename AUTOHOME
4. File: Auto and Home Policies
5. Choose OK
Exercise 3-4:
Data Field Definitions

1. Left pane, select Data


File Definitions
2. Left pane, select
AUTOHOME
3. Right pane, right click,
select New Field
4. Complete field
information
Lab 3-5:
Copy Data File and Field Definitions
1. Left pane, select Data File Definitions
2. Right pane, select AUTOHOME
3. Right-click, select COPY
4. Left pane, select Data File Definitions
5. Right pane, right-click, select PASTE
6. Name File: LIFE
7. Choose OK
Module Summary

 Data file definitions


 Data file format
 Meta data
 Jobs and Stages
 Run and Deploy
 Project Libraries
Investigate and Data Quality Assessment
Module Objectives

 Describe how the Investigate stage is used to assess data


quality in the project life cycle
 Identify the three types of Investigate stage
 Character Discrete Investigate
 Character Concatenate Investigate
 Word Investigate
 Design Investigate stages and run Investigate jobs
 Review and analyze Investigate results
Project Planning & Requirements

Planning Identify Objectives

Data Assessment

Requirements Define Business Requirements

Define Data Requirements

Application Design Plan

Define Development Plan


Data Re-engineering Methods

Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)

I II II IV
Investigation Conditioning I Matching Survivorship
High-Level DFD
Select US Data
for further Reject NON US Data
AUTOHOME LIFE processing

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area

Identify Duplicate
Add Unique Key Customer Records
Append Data to a Investigate
common format Conditioned Results

Survive the Best


Customer Record

Standardize Country Apply User Overrides


Data Assessment
 Verify the domain
 Review each field and verify the data matches the meta data
 Identify data formats, missing and default values
 Identify data anomalies
 Format
 Structure
 Content
 Discover “unwritten” business rules
 Identify data preparation requirements
Investigate Stage

 Features
 Analyze free-form and single domain fields
 Provide frequency distributions of distinct values and patterns
 Investigate methods
 Character Discrete
 Character Concatenate
 Word
Investigate Methods

Method Why

Analyzing field values, formats, and domains


Character Discrete

Cross-field correlation, checking logic relationships


Character Concatenate
between fields

Identifying free-form fields that may require parsing


Word Investigation and discovery of key words for classification
Investigate Terminology

Field Masks Options that represent the data. Options:


Character (C), Type (T), Skipped (X)

Tokens Individual units of data

Character Mask Usage

C For viewing the actual character values of the data

T For viewing the pattern of the data

X For ignoring characters


Field Mask Examples

Token Mask Result


02116 CCCCC 02116
02116 CCCXX 021
01832-4480 TTTTTTTTTT nnnnn-nnnn
XJ2 6EM TTTTTTT aanbnaa
(617) 338-0300 CCCCCCCCCCCCCC (617) 338-0300
617-338-0300 TTTTTTTTTTTT nnn-nnn-nnnn
6173380300 CCCXXXXXXXXX 617
(617)3380300 CCCXXXXXXXXX (61
Character Discrete: Field Mask (C)haracter

 Usage: Domain quality


 View the contents of each field to verify that the data
values match the field labels
 Investigate Stage:
 Generates Reports for frequency and pattern references
 Report naming conventions:
 jobp.FRQ – Results sorted by frequency, descending order
 jobp.SRT – Results sorted by field mask, ascending order
 job.PAT – Pattern reference file
Character Discrete - Character Results

Field FRQ Sample


Name Count FRQ % Field Mask “Example”
DOB
DOB 00000908
00000908 45.309%
45.309% [X]|
[X]|
DOB
DOB 00000005
00000005 0.250%
0.250% 00000000
00000000 [X]|
[X]| 00000000
00000000
DOB
DOB 00000004
00000004 0.200%
0.200% 19440225
19440225 [X]|
[X]| 19440225
19440225
DOB
DOB 00000004
00000004 0.200%
0.200% 19440609
19440609 [X]|
[X]| 19440609
19440609
DOB
DOB 00000004
00000004 0.200%
0.200% 19460212
19460212 [X]|
[X]| 19460212
19460212
POLNUMB
POLNUMB 00000001
00000001 0.050%
0.050% 014669402
014669402 [X]|
[X]| 014669402
014669402
POLNUMB
POLNUMB 00000208
00000208 11.00%
11.00% 617-338-0300[X]|
617-338-0300[X]| 617-338-0300
617-338-0300
POLNUMB
POLNUMB 00000001
00000001 0.050%
0.050% AM07B002470
AM07B002470 [X]|
[X]| AM07B002470
AM07B002470
POLNUMB
POLNUMB 00000001
00000001 0.050%
0.050% AM07B002736 [X]|
AM07B002736 [X]| AM07B002736
AM07B002736

[X] indicates a new set of example records


Character Discrete: Field Mask (T)ype

 Usage: Data formats (patterns):


 View the format of field which contain that you suspect may
follow or conform to a specific format, e.g., dates, PIN, Tax
ID, account numbers.
 Generates reports for frequency and pattern references
 Report naming conventions:
 jobp.FRQ – Results sorted by frequency, descending order
 jobp.SRT – Results sorted by field mask, ascending order
 job.PAT – Pattern reference file
Exercise 4-1:
Character Discrete Investigate
1. Create Investigate job
2. Identify the type of investigation
3. Select input file
4. Choose field (s) and mask options
5. Stage and run job
6. Review report results
Lab 4-1:
Character Discrete Investigate – Type T
1. Add Investigate job
2. Identify the type of investigation
3. Select input file
4. Choose field (s) and mask options
5. Stage and run job
6. Review report results
Character Concatenate

 Usage: Identify Field Relationships


 Investigate one or more fields to uncover any relationship
between the field values.
 QualityStage Toolkit
 Uses combinations of character masks
 Generates Reports for frequency and pattern references
 Report naming conventions:
 jobp.FRQ – Results sorted by frequency, descending order
 jobp.SRT – Results sorted by field mask, ascending order
 job.PAT – Pattern reference file
Character Concatenate Results

DOB and DOD Fields

FRQ
Count FRQ % Field Mask Sample / “Example”

00000908
00000908 45.309%
45.309% bbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbb [X]
[X] ||
00000020
00000020 2.009% bbbbnnnnbbbbbbbb
2.009% bbbbnnnnbbbbbbbb [X]
[X] || 1904
1904
00001096
00001096 54.691% nnnnnnnnbbbbbbbb
54.691% nnnnnnnnbbbbbbbb [X] | 06011944
[X] | 06011944

[X] indicates a new set of example records


Exercise 4-2: Character Concatenate

1. Add Investigate job


2. Identify the type of investigation
3. Select input file
4. Choose field (s) and mask options
5. Stage and run job
6. Review report results
Word Investigate

 Usage: Pattern free-form fields and lexical analysis


 To view the pattern of the data within a freeform text field
and parse it into individual tokens
 QualityStage
 Apply rules sets to free-form fields
 Discover parsing requirements
 Pattern data
 Generates reports for word frequency, pattern frequency
distributions, and word classification
Word Investigation Results

Pattern Reports

^D?T 639 N MILLS AVE


^D?S 306 W MAIN ST
^D?T 3142 W CENTRAL AVE Word Frequency Reports
^?T 843 HEARD AVE
0000000869 ST
0000000791 RD
0000000622 STE
0000000566 AVE

Word Classification Reports


ABBOTT ABBOTT ? ;0000000001
ABERCON ABERCON ? ;0000000001
ABERCORN ABERCORN ? ;0000000007
ABERDEEN ABERDEEN ? ;0000000001
Rule Sets

 Rules for parsing, classifying, and organizing data


 Rule Set Domains
 Country processing
 Pre-processing
 Domain Processing
 Name: Business and Personal
 Street Address
 Area: Locality, City, State and Zip/Postal codes

 Multinational Address Processing


Parsing

 Parse free-form data with the SEPLIST and a


STRIPLIST
 SEPLIST - Any character in the SEPLIST will
separate tokens, and become a token itself
 STRIPLIST - Any character in the STRIPLIST
will be ignored in the resulting pattern
 The SEPLIST is always applied first
Parsing Example

Example: 120 Main St. N.W.


SEPLIST “¬.”
STRIPLIST “¬“
Token1 Token2 Token3 Token4 Token5 Token6 Token7 Token8
120 Main St . N . W .

SEPLIST “¬”
STRIPLIST “¬.“
Token1 Token2 Token3 Token4
120 Main St NW
SEPLIST “¬.”
STRIPLIST “¬.“
Token1 Token2 Token3 Token4 Token5
120 Main St N W
Data Typing: Classifying Tokens

 Identify and type the token in terms of it’s business


meaning and value
MASK KEY:
N – Numeric token
120 Main Street Apt 6C
A – Alpha token
N A A A M
M – Mixed Token

PATTERN KEY:
^ – Numeric token 120 Main Street Apt 6C
? – Unclassified alpha token
@, <, > – Mixed Token ^ ? T U >
T – Street Type
U – Unit Type
Example: Word Investigate
ABBOTT ABBOTT ? ;0000000001
ABBOTT
ABERCON ABBOTT
ABERCON ?? ;0000000001
;0000000001
ABERCON
ABERCORN ABERCON
ABERCORN ?? ;0000000001
;0000000007
ABERCORN
ABERDEEN ABERCORN
ABERDEEN ?? ;0000000007
;0000000001
ABERDEEN ABERDEEN ? ;0000000001

^D?T 639 N MILLS AVE 0000000869 ST


^D?T 639W NMAIN
MILLS 0000000869RDST
^D?T 306 ST AVE 0000000791
0000000791STE
RD
^D?T
^D?T 306 W MAIN
3142 W CENTRAL ST AVE 0000000622
^D?T 3142 W CENTRAL AVE 0000000622 STE
0000000566 AVE
^?T 843 HEARD AVE 0000000566 AVE
^?T 843 HEARD AVE

Produce Reports based


on Patterns & Tokens

Classify known words


and
assign default tags ^ ? T U ^
Parse 21 WINGATE STREET APARTMENT 601
Lab 4-3: Word Investigation
Address and Area
1. Add Investigate job
2. Identify the type of investigation
3. Select input file
4. Choose rule set and field(s)
5. Choose Advanced Options
6. Stage and run job
7. Review report results
Data Quality Assessment

 Review and analyze each field for the following


information:
 How often is the field populated?
 What are the anomalies and out-of-range values? How often
does each one occur?
 How many unique values were found?
 What is the distribution of the data or patterns?
 Use Investigate results to:
 Update business requirements
 Define development plan and application design
Quiz

 What is domain integrity?


 What is the difference between a Type C and a Type T
field mask?
 When might you use a Type X field mask?
 Where can you find the Investigate reports?
Module Summary
 DRE Methodology: Data Quality Assessment
 Character discrete, concatenate and word investigation
 Field Masks
 Character (C)
 Type (T)
 Ignore (X)
 Parsing – SEPLIST, STRIPLIST
 Data Classification
 Patterns
Data Preparation
Data Preparation

 Format of data file


 Unique record identifier
 Common record layout
High-Level DFD

Select US Data
for further Reject NON US Data
AUTOHOME LIFE processing

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area

Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format  Investigate
Conditioned Results

Survive the Best


Customer Record

Standardize Country Apply User Overrides


Data File Format
 Preferred data file format for QualityStage is:
 Fixed record length
 Fix fielded data
 Sequential file with terminated records
 Alphanumeric data
 QualityStage provides the following features for working with
other file formats:
 ODBC enabled for pulling/pushing data from/to a table
 Unterminated and Variable length
 Fixed-length unterminated
 The Transfer (GTF) stage is used to read in the various formats
and output a fixed-record length terminated file
Unique Record Key
 Every record should start the QualityStage process with a unique
record key
 This key can be created in QualityStage or by other tools like
DataStage
 The QualityStage Investigate Stage will help validate if a unique
key exists
 This unique key provides developers with a way to audit each
record as it passes through the QualityStage application
 The Transfer Stage can be used to create a new key field and
populate the new field with a unique value
Common Data Format

 Fields identified for processing should be moved


forward from each source and appended into a single
new source file
 Allows for efficiently processing all data in one stream using
one set of rules
 In QualityStage, appending data files is accomplished
with the Transfer (GTF) stage
Transfer Stage (GTF)
 Transforms data file formats to fixed length flat files
 Adds new fields
 Assign literal values such as a source indicator
 Generate and assign a sequential value
 Reformatting record layouts
 Dropping fields
 Format field data
 Case formatting
 Right/left justification
 Right/left fill
 Concatenate fields
 Append Data files
Add a Record Key
Input Data File
1,25 26,25 52,20 73,2 76,5

639 N MILLS AVE ORLANDO FL 32803


4275 OWENS RD, STE 536 EVANS GA 30809
ERIN OFFICE PARK DUBLIN GA 31021
600 E OGLETHORPE HWY HINESVILLE GA 31313
Populate Record Transfer
Key field Stage
(GTF)

FW000000001 639 N MILLS AVE ORLANDO FL 32803


FW000000002 4275 OWENS RD, STE 536 EVANS GA 30809
FW000000003 ERIN OFFICE PARK DUBLIN GA 31021
FW000000004 600 E OGLETHORPE HWY HINESVILLE GA 31313

1,12 13,25 38,25 64,20 85,2 88,5

Output Data File


Record Key Best Practices

 Add a unique record identifier in the QualityStage


process or prior to entering QualityStage processing
 Create a 12 byte field
 The first 2 bytes indicate the source
 Positions 3 through 11 store a sequential number
 Position 12 is intentionally left blank for training providing a
space between the record key and the data
Append Data Files
 The transfer stage can read one
input file and produce one output AUTOHOME LIFE
file

 To append data, you will need to


define a Transfer stage for each
file you want to append Transfer
TransferStage
Stage22
Transfer
TransferStage
Stage11 Append
Appendoptions
optionsselected
selected

 Be careful of the order – the first


transfer does not generally
append – only subsequent
transfer stages referencing the
same output file should append
data

COMBINED
Exercise 5-1:
Add a Record Key and Append Data Files

1. Read in each source of data


2. Define a new output file with a AUTOHOME
AUTOHOME
common format
3. Create Transfer Stage 1
4. Create new Record Key field
Stage
Stagename:
name:AHKEY
AHKEY
5. Populate the Record Key field Stage
Stagetype:
type:Transfer
Transfer
6. Add AUTOHOME Data to Job
JobName:
Name:Append
Append
new COMBINED output file

COMBINED
COMBINED
Lab 5-1:
Append LIFE to COMBINED Output
1. Create transfer stage
2. Define new record key LIFE
LIFE
field
3. Populate the record key
field Stage
Stagename:
name:LFKEY
LFKEY
Stage
Stagetype:
type:Transfer
Transfer
4. Append LIFE to Job
JobName:
Name:Append
Append
AUTOHOME in the
COMBINED output file
COMBINED
COMBINED
Module Summary

 QualityStage requires files to be fixed record length


terminated records.
 The Transfer stage can be used to:
 Convert file formats to fixed record length
 Add new fields and populated with literal values or sequential
numbers
 Append data files
 Format Fields
 Reformat record layout
Standardize
Module Objectives

 Describe the Standardize Stage in the Data Re-


engineering Methodology
 Identify Rule Sets
 Apply the Standardize Stage
 Interpret Standardize results
 Investigate unhandled data and patterns
Project Lifecycle: Development

Review Data Flow Diagram

Construct Application

Development
{ Standardize Data
Find Duplicate Candidate (Match)
Survive Best of Breed (Survive)

Unit Test

Review & Refine


Data Re-engineering Methods

Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)

I II II IV
Investigation Standardize I Match Survive
High-Level DFD
Select US Data
for further Reject NON US Data
AUTOHOME LIFE processing

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area

Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format  Investigate
Conditioned Results

Survive the Best


Customer Record

Standardize Country Apply User Overrides


Standardize

 Transformation
 Parsing free form fields
 Comparison threshold for classifying like words
 Bucketing data tokens
 Standardization
 Applying standard values and standard formats
 Phonetic Coding for use in Matching
 NYSIIS
 Soundex
Standardize Example
Input
InputFile:
File:
Address Line 1 Address Line 2
Address Line 1 Address Line 2

1721
1721WW ELFINDALE
ELFINDALEST
ST UNIT
UNIT20
20
1721
1721 W ELFINDALE ST##20
W ELFINDALE ST 20
16200 VENTURA BOULEVARD
16200 VENTURA BOULEVARD SUITE
SUITE201
201
C/O
C/OJOSEPH
JOSEPHCCREIFF
REIFF 12
12 WESTERNAVE
WESTERN AVE
1705 W St
1705 W St PHILADELPHIA
PHILADELPHIA
1655
1655PONCE
PONCEDEDELEON
LEONAVENUE
AVENUE 15
15TH FLOOR
TH
FLOOR

Result
ResultFile:
File:
House # Dir Str. Name Type Unit Unit. Floor Floor NYSIIS Soundex City
House # Dir Str. Name Type Unit Unit. Floor Floor NYSIIS Soundex City
Type Value Type Value
Type Value Type Value

1721
1721 WW ELFINDALE
ELFINDALE AVE
AVE UNIT
UNIT 20
20
1721
1721 WW ELFINDALE
ELFINDALE AVE
AVE 20
20
16200
16200 VENTURA
VENTURA BLVD
BLVD
12
12 WESTERN
WESTERN AVE
AVE
1705
1705 WW ST
ST PHILADELPHIA
PHILADELPHIA
1655
1655 PONCE
PONCEDE
DELEON
LEON AVE
AVE FLOOR
FLOOR15
15
Standardize Process

Output File House Street Unit


Number Street Name Type Type Unit

21 WINGATE ST APT 601


Key:
^ = Single numeric
? = One or more
unknown alphas
T = Street type Process Patterns
U = Unit type and Bucket Data

Classify &
assign default tags ^ ? T U ^
Parse 21 WINGATE STREET APARTMENT 601
Standardize Stage

 Standardize Stage
 Uses Rule sets for:
 Country processing
 Pre-domain processing
– USPREP
 Domain processing
– USADDR
– USAREA
– USNAME
 Multi-national Address
 WAVES
Types of Rule Sets

Country
CountryIdentifier
Identifier
COUNTRY
COUNTRY

Domain
DomainPre-processor
Pre-processor
USPREP
USPREP

Domain
DomainSpecific:
Specific: Domain
DomainSpecific:
Specific: Domain
DomainSpecific:
Specific:
USNAME
USNAME USADDR
USADDR USAREA
USAREA
Example: Country Identifier

Input
InputRecord
Record

100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,MAMA02111
02111
SITE
SITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLEAB
ABT0L
T0L1K0
1K0
28
28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
123
123MAIN
MAINSTREET
STREET

Output
OutputRecord
Record

US
US YY 100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,MAMA02111
02111
CA
CA YY SITE
SITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLEAB
ABT0L
T0L1K0
1K0
GB
GB YY 28
28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
US
US NN 123
123MAIN
MAINSTREET
STREET
Example: Domain Pre-Processor

Input
InputRecord
Record

Field
Field11 JIM
JIMHARRIS
HARRIS(781)
(781)322-2426
322-2426
Field
Field22 92
92DEVIR
DEVIRSTREET
STREETMALDEN
MALDENMAMA02148
02148

Output
OutputRecord
Record

Name
NameDomain
Domain JIM
JIMHARRIS
HARRIS
Address
AddressDomain
Domain 92
92DEVIR
DEVIRSTREET
STREET
Area
AreaDomain
Domain MALDEN
MALDENMA MA02148
02148
Other
OtherDomain
Domain (781)
(781)322-2426
322-2426
Example: Domain-Specific
Input
InputRecord
Record

100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOOR

Output
OutputRecord
Record

House
HouseNumber
Number 100
100
Street
StreetName
Name SUMMER
SUMMER
Street
StreetSuffix
SuffixType
Type ST
ST
Floor
FloorType
Type FL
FL
Floor
FloorValue
Value 15
15
Address
AddressType
Type SS
NYSIIS
NYSIISofofStreet
StreetName
Name SANAR
SANAR
Reverse
ReverseSoundex
Soundexof ofStreet
StreetName
Name R520
R520
Input
InputPattern
Pattern ^+T>U
^+T>U
Rule Sets

 Rule sets contain logic for:


 Parsing
 Classifying
 Processing data by pattern and bucketing data
 Three required files
 Classification Table
 Dictionary File
 Pattern Action File
 Optional Lookup Tables
Rule Sets Files

Contains standard abbreviations that


Classification Table (.CLS) identify and classify key words.

Pattern Action File (.PAT) Contains a series of patterns and


programming commands to condition the
data

Dictionary File (.DCT) Define the output file fields to store the
parsed and conditioned data

Rule Set Description (.PRC) Description file for the Rule Set

Optional conversion and lookup tables


Lookup Tables (.CLS) for converting and returning
standardized values

Tables for storing overrides entered into


Override Tables (.CLS)
the Designer GUI
Classification Table

 Contains the words for classification, standardized


versions of words, and data class
 Data class (data tag) is assigned to each data token
 Default classes are the same across all rule sets
 User-defined classes are assigned in the classification
table
 Users may modify, add or delete these classes
 User-defined classes are a single letter
Default Classes

Class Description

^ A single numeric
+ A single unclassified alpha (word)
? One or more consecutive unclassified alphas
@ Complex mixed token, e.g., ½, O’Connell
> Leading numeric, e.g., 6A
< Trailing numeric, e.g. A6
Zero Null class
User-defined Classes

Class Description
USNAME
G Generational, e.g., Senior, I, II
P Prefix, e.g. Dr., Mr., Miss
USADDR
T Street Type
D Directional
B Box Type
USAREA
S State Abbreviation
Comparison Threshold
 May be used in the
Threshold level
Classification table
900 Exact match
 Used to efficiently make
entries into the classification 850 Almost certainly the same
table Most likely equivalent
800
 Helps overcome spelling and
750 Most likely not the same
data entry errors
 Not required 700 Almost certainly not the same

 Threshold uses a logical string


comparator
Classification Table Example
;-------------------------------------------------------------------------------
; USADDR Classification Table
;-------------------------------------------------------------------------------
; Classification Legend
;-------------------------------------------------------------------------------
; B - Box Types
; D - Directionals
; F - Floor Types
; H - Highway Modifiers
; R - Rural Route, Highway Contract, Star Route
; T - Street Types
; U - Unit Types
;-------------------------------------------------------------------------------
; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending
;-------------------------------------------------------------------------------
DRAW "PO BOX" B
DRAWER "PO BOX" B
PO "PO BOX" B
POB "PO BOX" B
POBOX "PO BOX" B
POBX "PO BOX" B
PODRAWER "PO BOX" B
Dictionary File

 Defines the fields definitions for the output file


 When data is moved to these output fields it is called
“bucketing” the data
 The order that the fields are listed in the dictionary file
defines the order the fields are written to the output file
 Dictionary file entries are similar to field definitions
Dictionary File Example
;;QualityStage v7.0
\FORMAT\ SORT=N
;-------------------------------------------------------------------------------
; USADDR Dictionary File
;-------------------------------------------------------------------------------
; Total Dictionary Length = 415
;-------------------------------------------------------------------------------
; Business Intelligence Fields
;-------------------------------------------------------------------------------
HN C 10 S HouseNumber ;0001-0010
HS C 10 S HouseNumberSuffix ;0011-0020
PD C 3 S StreetPrefixDirectional ;0021-0023
PT C 20 S StreetPrefixType ;0024-0043
SN C 25 S StreetName ;0044-0068
ST C 5 S StreetSuffixType ;0069-0073
SQ C 5 S StreetSuffixQualifier ;0074-0078
SD C 3 S StreetSuffixDirectional ;0079-0081
RT C 3 S RuralRouteType ;0082-0084
RV C 10 S RuralRouteValue ;0085-0094
BT C 7 S BoxType ;0095-0101
BV C 10 S BoxValue ;0102-0111
Pattern Action File

 The Pattern-Action file contains the rules for


standardization; that is, the actions to execute with a
given pattern of tokens
 Records are processed from the top down
 Written in Pattern Action Language
 Complex parsing can be coded in this file
Pattern Action File Example

Street Address 10 Hollow Oak Road


Pattern ^ ? T
Pattern Action Language
COPY [1] {HN}
COPY_S [2] {SN}
COPY_A [3] {ST}

{HN} {SN} {ST}


10 Hollow Oak Rd
Optional Lookup Tables

 Called from the Pattern Action File


 Rule sets may contain lookup tables such as:
 Common First Names and Enhanced First Names
 Barb & Barbara
 Ted & Edward

 Gender based on name


 State abbreviations
 Common city abbreviations
 NYC = New York City
 LA = Los Angeles
Standardize Process

Dictionary File House Street Unit


Number Street Name Type Type Unit

21 WINGATE ST APT 601

Pattern Action File

Classification Table Process Patterns


and Bucket Data

Classify &
assign default tags ^ ? T U ^
Parse 21 WINGATE STREET APARTMENT 601
Standardizing International Data

 Two methods
 Method 1: Use country pre-processor, domain pre-processor,
and domain-specific rules
 Uses out-of-the-box, included functionality/rules

 Method 2: Use Multinational Standardize and WAVES


 Requires purchase of WAVES database
Making WAVES
 When working with files containing multinational addresses,
QualityStage provides the following tools:
 Multinational Address Standardization stage which standardizes address
files at city-and street-levels
 Functionality available out-of-the-box

 WAVES (Worldwide Address Verification and Enhancement System) stage


which standardizes, corrects, verifies and enhances addresses against a
country-specific postal reference file
 Requires purchase of WAVES database
What the WAVES Stage Does
 After the original input files have been standardized, the WAVES stage
performs these main functions
 Corrects – Corrects defects in the input data due to typographical or spelling errors
 Verifies – Using probabilitistic matching, WAVES stage tries to match corrected
address records against addresses in a country-specific postal reference file
 Enhances – Assigns the portal record data to the appropriate fields in output file,
thereby substituting any erroneous and missing input data with verified postal data
 Indicates Verification result – Shows whether a record has been successfully verified
by the WAVES stage
 Overall degree of verification also indicated
WAVES Input File Requirements
 Fixed-fielded, fixed record length data files
 Total line length cannot exceed 4096
 Address data must occur within first 3072
 Each record must contain
 Country indicator
 Full spelling
 Abbreviation
 2- or 3-bytes ISO country code
 Mismatch of country indicator to country- and street-level formats results in data
not being standardized and output as unhandled
– For example, identifier says record is German and address format is that of France
 Unique record identifier (record key)
 Use preprocessor to remove any non-address data from the address fields
 c/o
 Attn:
 Department
 Multinational Standardize (MNS) stage automatically used as part of
WAVES stage processing
WAVES Output
 City-level verification
 Correct, enhance and verify city field
 Correct, enhance and verify neighborhood/locality field
 Correct, enhance and verify state/province field
 Verify postal code field (but not correct it)
 Indicate if record has been verified (and to what degree)
 Street-level verification
 Correct, enhance and verify the street info field
 Correct and verify postal code
 Indicate the match weight, which shows the degree of certainty of the
probabilistic match between the input and reference data
About the Verification Process
 Use ISO codes, which are applied during standardization, as
critical match fields on all city and street level verification
attempts
 Try to verify the city, state/province, and postal code are correct
based upon the available information in the record
 For example, if no state/province is available, uses postal code to impute the
missing information
 If the postal code conflicts with the city, WAVES copies the city
and province fields from the postal record, but does not change
the postal code since WAVES cannot verify which is the correct
data
Modifying Standardization Behavior

 MNS rules (used by WAVES) can be modified using the


override functionality in QualityStage Designer
High-Level DFD
Select US Data
for further Reject NON US Data
AUTOHOME LIFE processing

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area

Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format  Investigate
Conditioned Results

Survive the Best


Customer Record

Standardize Country  Apply User Overrides


Country Rule Set
 Country Rule set appends the two byte ISO country
code
 Input to the country rule set includes:
 Street Address
 City or locality
 State
 Zip or Postal code
 Country field (if it exists)
 Output:
 Two byte ISO country code
 Flag identifying explicit or default decision
Exercise 6-1:
Country Standardize
1. Define the output file COMBINED
COMBINED
2. Create the job
3. Define the job details
 Select Country rule set
 Identify fields to be Stage
Stagename:
name:CNTRYSTAN
CNTRYSTAN
conditioned Stage
Stagetype:
type:Standardize
Standardize
 Enter metadata label Rule
Ruleset:
set:COUNTRY
COUNTRY
Job
JobName:
Name:STAN
STAN
4. Run the job
5. Review results

CNTRYOUT
CNTRYOUT
High-Level DFD
Select US Data
for further
 Reject NON US Data
AUTOHOME LIFE processing

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area

Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format  Investigate
Conditioned Results

Survive the Best


Customer Record

Standardize Country  Apply User Overrides


Selecting US Data

 The QualityStage Select Stage provides the capability of


selecting and/or rejecting records based on a set of
values for a field
 Selecting or splitting data requiring compound or
complex logic may require multiple select stages or a
custom rule set
Select Stage

 Select Stage accepts one input file and may output


multiple files
 Accept – records that meet the criteria
 Reject – records that do not meet the criteria
 Split – both the Accept and Reject file are created
 Input and output files have the same layout
 Select allows you to choose a field and create a list of
values
 If a record is equivalent to a value on the list then the record is
accepted, else it is rejected
Exercise 6-2:
Split US From Non-US Data
1. Create output files CNTRYOUT
CNTRYOUT
 One for Accept records
 One for reject records

2. Create Select stage


3. Add to Stan Job Stage
Stagename:
name:SPLIT
SPLIT
Stage
Stagetype:
type:Select
Select
4. Run Select stage Job
JobName:
Name:STAN
STAN
5. Review Results

USDATA
USDATA NONUSDATA
NONUSDATA
(Accept)
(Accept) (Reject)
(Reject)
Domain Pre-Processor Rule Sets

 Pre-processor rule sets are designed to filter name, street


address and area (city, state, zip) data
 For example, if the city, state and zip is found in ADDRESS
LINE 2, the pre-processor rule set will attempt to recognize
this data and move it into the area domain
 The pre-processor rule set prepares the data for
processing by domain specific rule sets
Exercise 6-3:
US Prep Rule Set
1. Define the output file USDATA
USDATA
2. Create the job
3. Define the job details
 Select US PREP rule set
 Identify fields to be conditioned
 Enter metadata labels Stage
Stagename:
name:PREPDATA
PREPDATA
Stage
Stagetype:
type:Standardize
Standardize
4. Run the job
Rule
Ruleset:
set:USPREP
USPREP
5. Review results Job
JobName:
Name:STAN
STAN

PREPOUT
PREPOUT
Domain Rule Sets

 Domain rule sets expect only data for that domain as the
input
 Domain rule sets that come with QualityStage are:
 Name
 Street address
 Area (city, state and zip)
High-Level DFD
Select US Data
for further
processing
 Reject NON US Data
AUTOHOME LIFE

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area

Identify Duplicate
Add Unique Key Customer Records
Append Data to a
common format  Investigate
Conditioned Results

Survive the Best


Customer Record

Standardize Country  Apply User Overrides


USNAME Rule Set

 The USNAME rule set works on both personal names


and organization names for US data
 Data is parsed into name components
 Phonetic coding of the First Name and Primary Name
are created for matching
USADDR Rule Set

 This rule set is applied to street address fields


 The “Address Type” flag identifies different types of
addresses
 ‘S’ Street address
 ‘B’ Box address
 ‘R’ Rural route address
 Phonetic coding of the Street Name is created for
matching
USAREA Rule Set

 This rule set is applied to city, state and postal code


fields
 Data is parsed into city name, state abbreviation, zip
code and zip plus four
 Phonetic coding of the city name is created for matching
Exercise 6-4:
Domain Standardize
1. Define the output file PREPOUT
PREPOUT
2. Create the job
3. Define the job details
 Select USNAME, USADDR,
USAREA rule sets
Stage
Stagename:
name:STANALL
STANALL
 Identify fields to be conditioned Stage
Stagetype:
type:Standardize
Standardize
4. Run the job Rule
Ruleset:
set:USNAME, USADDR, USAREA
USNAME, USADDR, USAREA

5. Review results Job


JobName:
Name:STAN
STAN

STANOUT
STANOUT
Standardize Results
 Business Intelligence fields
 Parsed from the original data, they may be used in matching
and generally they are moved to the target system
 Matching Fields
 Generally these fields are created to help during the match
process and are dropped after successful matching
 Reporting fields
 Specifically created to help review results of Standardize and
recognized handled and unhandled data
Business Intelligence Fields

 Intelligent data parsed and bucketed from the input


free-form field

USNAME Examples USADDR Examples USADDR Examples


• Title  HouseNumber City

• First Name  Directional State


• Middle Name  Street Name Zip5


• Primary Name  Unit Types Zip4


• Generational  Box Types


 Unit Values
 Building Names
Standardize Matching Fields

 Phonetic coding
 NYSIIS
 Reverse NYSIIS
 Soundex
 Reverse Soundex
 Hash keys
 First 2 characters of the first five words
 Packed Keys
 Data concatenated, or packed
Standardize Reporting Fields

The pattern generated for tokens not processed


Unhandled Pattern by the selected rule set.

The remaining tokens not processed by the


Unhandled Data
selected rule set.

The pattern generated for the stream of input


Input Pattern tokens based on the parsing rules and token
classifications.

The tokens not processed by the rule set because


Exception Data they represent a data exception.
Best Practice:
Investigate Handled Data
 Review the business intelligence fields to ensure accurate
bucketing of the data
 Build a Character Discrete Investigation for each field
and review the contents and the format
 Build Character Concatenate Investigation to review:
 Unhandled Patterns
 Unhandled Data
 Input Pattern
 Input Fields
High-Level DFD
Select US Data
for further
 Reject NON US Data
AUTOHOME LIFE processing

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area 
Identify Duplicate
Customer Records


Add Unique Key
Append Data to a
common format  Investigate
Conditioned Results

Survive the Best


Customer Record

Standardize Country  Apply User Overrides


Exercise 6-5:
Investigate NAME Unhandled Patterns and Data

Identify the unhandled patterns for the NAME field. In the


report include the unhandled data, input pattern, original data
and the record key.

1. Build a Character Concatenate Investigation using the following


fields
2. Increase the number of samples to 5

Field Name Field Description Type


UPUSNAM Unhandled Pattern C

UDUSNAM Unhandled Data X


IPUSNAM Input Pattern X
NAME Original Name field data X
Exercise 6-6:
Investigate Address/Area Unhandled Patterns

Identify the unhandled patterns for the Address and AREA


fields. In the report include the unhandled data, input pattern,
original data and the record key.

1. Build a Character Concatenate Investigation using the following


fields
2. Increase the number of samples to 5

Field Name Field Description Type


UPUSADD Unhandled Pattern C

UDUSADD Unhandled Data X


IPUSADD Input Pattern X
ADDR1 Address Line 1 X
ADDR2 Address Line 2 X
Module Summary

 QualityStage comes with pre-defined rule sets that are


highly flexible and customizable
 Support multi-national address processing
 Country rule sets
 Pre-processing rule sets for mixed-domain challenges
 Domain rule sets
 Custom rule sets
Rule Set Overrides
Module Objectives

 Identify the location of the User Override functionality


 Describe the different types of User Overrides
 Apply User Overrides
 Test User Overrides
 Locate the User Override tables
Data Re-engineering Methods

Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)

I II II IV
Investigation Standardize I Match Survive
High-Level DFD
Select US Data
for further
 Reject NON US Data
AUTOHOME LIFE processing

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area 
Identify Duplicate
Customer Records


Add Unique Key
Append Data to a
common format  Investigate
Conditioned Results

Survive the Best


Customer Record

Standardize Country  Apply User Overrides



Customizing Rule Sets

 A rule set may require modification if some input data


is:
 Not processed
 Incorrectly processed
 QualityStage
 User Overrides
 Rules Analyzer
User Overrides
 Provides the user the ability to modify the rule sets
 The following types of rule sets can be modified using User
Overrides
 Domain Pre-processor rule sets
 Domain rule sets own Standardize rules
 Validation rule sets
 Multinational Address Standardize rule sets
 There are five types of user overrides relating to: classifications,
patterns, and text strings
 User overrides are
 GUI Driven
 Stored in separate lookup tables
User Classification Override

 Recognized as a keyword and classified


 Additional words
 New abbreviation, variation
 Misspelling of a word

 User Classifications may override or add:


 Original values (Token values)
 Standard value
 Class
Example: Classification Override

Unhandled Data Input Pattern Original Data


+,+ HOCHREITER , CAROLYNNE
Add CAROLYNNE
as a valid first name
to the classification table

Override Token Value Standard Value Class


Carolynne Carolynne F

Input Pattern Original Data


Corrected Pattern
+,F HOCHREITER , CAROLYNNE
Text Overrides

 Allow the user to specify overrides based on an entire


text string
 Use this override for special cases and specific handling
of a string of text
 Input Text Overrides
 Applied to the original text string
 Unhandled Text Overrides
 Applied to the Unhandled Data field
Example: Input Text Overrides

Unhandled Pattern Input Text


Input Text ++ ZACHARIA GELLMAN
++ TOMMOTHY CABBOTT
++ REIFF FUNERAL

Override Input Text Override


REIFF FUNERAL Move text string to
the Primary name field

Input Pattern Primary Name


Results
++ REIFF FUNERAL
Pattern Overrides

 Allow the user to specify overrides based on an entire


pattern
 Use this override when most or all records should be
processed with identical logic
 Input Pattern Overrides
 Applied to the original text string
 Unhandled Pattern Overrides
 Applied to the Unhandled Data field
Unhandled Pattern Overrides
Unhandled Pattern Input Text
Unhandled +, + HAYWARD, WINSLOW
Pattern +, + ESHAGHIAN , JOUBI
+, + BOULDER, CORONA

Unhandled Pattern Override


Override +, + Move + to Primary Name
Comma provides context
Move + to First Name

Unhandled Pattern First Primary


Name Name
+, + WINSLOW HAYWARD Results
+, + JOUBI ESHAGHIAN
+, + CORONA BOULDER
User Override Precedence

User Classification Recognize words to classify

Modify logic based on the input string


Input Text

Input Pattern Modify logic based on the input pattern

Modify logic based on the Unhandled


Unhandled Text data string

Unhandled Pattern Modify logic based on the unhandled


pattern
Rule Set Precedence

CLASSIFICATION TABLE

USER CLASSIFICATION

INPUT TEXT

INPUT PATTERN

PATTERN ACTION FILE

UNHANDLED TEXT

UNHANDLED PATTERN
Rule Set Override Process

1. Enter override
2. Apply override
3. Test override with the Rules Analyzer
4. Repeat steps 1 through 3 for all desired overrides
Exercise 7-1:
Name Rule Set User Override
 Review the unhandled NAME patterns in the
INUPNAMp.frq report
 Apply NAME overrides
 Test NAME overrides
 Re-run the STAN Job to re-produce the new output file
with the overrides applied
Exercise 7-2:
Address and Area Overrides
 Review the Investigation reports of unhandled Address
and Area data
 Apply Users Overrides to unhandled data
 Test the Override
 Re-run the STAN Job to re-produce the new output file
with the overrides applied
Module Summary

 There are fives type of user overrides


 User overrides can be applied to:
 The classification table
 Input text
 Input patterns
 Unhandled text
 Unhandled patterns
 Overrides are applied in a specific order
 The Standardize Rules Analyzer can be used to test and
review user overrides
Match
Module Objectives

 Describe where Match fits in the Data Re-engineering


Methodology
 Describe QualityStage Match concepts
 Define the type of matching algorithms
 Describe the importance of blocking
 Apply multiple match passes to increase
efficiency/efficacy
 Interpret and improve match results
Data Re-engineering Methods

Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)

I II II IV
Investigation Standardize I Match Survive
High-Level DFD
Select US Data
for further
 Reject NON US Data
AUTOHOME LIFE processing

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area  
Identify Duplicate
Customer Records


Add Unique Key
Append Data to a
common format  Investigate
Conditioned Results

Survive the Best


Customer Record

Standardize Country  Apply User Overrides



Match Stage
 Statistically-based method for determining matches
 Over 24 match comparison algorithms providing a
full spectrum of fuzzy matching functions
 Ability to measure informational content of data
 Identify duplicate entities within one or more files
 Array matching
 Match wizards and templates
 Critical field settings
What Constitutes a Good Match?

Which of the following record pairs is a match? And how do you know?

W HOLDEN 12 MAIN ST
W HOLDEN 12 MAINE ST

W HOLDEN 128 MAIN PL 02111 12/8/62


W HOLDEN 128 MAINE PL 02110
12/8/62
WM HOLDEN 128A MAIN SQ 02111 12/8/62 338-0824
WILL HOLDEN 128A MAINE SQ 02110 12/8/62 338-
0824
 Do you compare all the shared or common fields?
 Do you give partial credit?
 Are some fields (or some values) more important to you than others?
Why?
 Do more fields increase your confidence?
 By how much? What is enough?
The Value of Information Content
 Information content measures the significance of one
field over another (Discriminating Value)
 A Gender Code contributes less information than a Tax-Id
Number
 Information content also measures the significance of
one value in a field over another (Frequency)
 In a First-Name Field, JOHN contributes less information than
DWEZEL
 Significance is determined by a value’s reliability and its
ability to discriminate, both can be calculated from your
data
Distribution of Weights
WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62
WILLAIM JOHN HOLDEN 128 MAINE AVE 02110
12/8/62
+1 +1 +17 +2 +4 -1 +7 +9 = 40

The weighted score is a relative measure of the probability of a match


4000

3500
# of Pairs

3000

2500

2000
Thresholds defined can be
1500 used for automated
1000 Non-Matches processing
500

0
Matches
-20 -10 0 10 20 30 40 Weight of Comparisons
rey area

Less Confidence More Confidence


Weights

 Measures the information content of a data value


 Each field contributes to the confidence (probability) of
a match
Types of Weights
 If a field matches, the agreement weight is used
 Agreement weight is a positive value
 If a field doesn’t match, the disagreement weight is used
 Disagreement is a negative value
 Partial weight is assigned for non-exact or “fuzzy”
matches
 Missing values have a default weight of zero
 Weights for all field comparisons are summed to form a
composite weight
Matching Terminology
Measures the significance of one field value
Informational Content over another

Weight Measures the informational content of a


data value

Composite Weight Measures the confidence of a match

Match Cutoffs Distinguish matches from non-matches

False Positives Records with a score above the High


cutoff that really aren’t a match

False Negatives Records below the low cutoff that really are
a match
Measuring the
Conditions of Uncertainty
 Reliability of the data in a given field
 Estimated as the probability that the field agrees given the
record pair is a match
 Probability of a random agreement of values
 Estimated as the probability the field agrees given the record
pair is not a match
Reliability (M-Probability)

 Approximated as, 1 - error rate for the given field


 The higher the m-probability, the higher the
disagreement weight will be for the field not matching
since the data is considered reliable
Chance Agreement (U-Probability)

 The u-probability can be approximated as the


probability that a field agrees at random (by chance)
 QualityStage uses a frequency analysis to determine the
probability of chance agreement for all values  
 Rare values bring more weight to a match
Calculating Weights

 Agreement weight is estimated as:


log2(m/u)
 Disagreement weight is estimated as:
log2 ((1-m)/(1-u))

M (m-prob) = .9
U (u-prob) = .01

Agreement weight log2 (.9/.01) = 6.49

Disagreement weight log2 (1-.9)/(1-.01) = -3.31


Blocking

 Grouping together like records that have a high-


probability of producing matches
 Only “like” records are compared to each other making
the match more efficient and computationally feasible
 Records in a “block” match exactly on one to several
blocking fields
Blocking Example: Sample Data

 Block on NYSIIS of Last Name

NYSIIS LNAME NAME ADDRESS ZIP


YANG
YANG YUNG, ,WAYNE
YUNG WAYNEDD 9000SHEPARD
9000 SHEPARDDRIVE
DRIVE 78753
78753
GARAS
GARAS GEROSA,FRAN
GEROSA, FRANXX 2929AARONS
AARONSCT
CT 06877
06877
YANG
YANG YOUNG, ,JONATHAN
YOUNG JONATHANAA 1767TOBEY
1767 TOBEYROAD
ROAD 30341
30341
GARAS
GARAS GERISA,FRANCIS
GERISA, FRANCIS 2929AARONS
AARONSCT
CT 06877
06877
GARAS
GARAS GEROSA,FRANCIS
GEROSA, FRANCISXAVIER
XAVIER 2929AARONS
AARONSCOURT
COURT 06877
06877
MATAC
MATAC MARCUSMATIC
MARCUS MATIC 100SUMMER
100 SUMMERSTREET
STREET 02111
02111
GARAS
GARAS GEROSA,MARY
GEROSA, MARY 2929AARONS
AARONSCT
CT 06877
06877
JANCAN
JANCAN RENEEJENKINS
RENEE JENKINS 100SUMMER
100 SUMMERSTREET
STREET 02111
02111
YANG
YANG YOUNGTHERESA
YOUNG THERESACC 1767TOBEY
1767 TOBEYROAD
ROAD 30341
30341
Blocking Example
NYSIIS NAME ADDRESS ZIP

YANG YOUNG , WAYNE D 9000 SHEPARD DRIVE 78753


YANG YOUNG , JONATHAN A 4220 BELLE PARK DR 77072
YANG YOUNG THERESA C 1767 TOBEY ROAD 30341

GARAS GEROSA, FRAN X 29 AARONS CT 06877


GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877
GARAS GEROSA, MARY 29 AARONS CT 06877
GARAS GARISA, FRANCIS 29 AARONS CT 06877

MATAC MARCUS MATIC 100 SUMMER STREET 02111

JANCAN RENEE JENKINS 100 SUMMER STREET 02111

Blocks with only one records are considered residuals


Balance Scope and Accuracy
Balance the scope and accuracy to compare a reasonable amount of “like” records

Accuracy Scope
The quality of the candidate records The number of records
Blocking Strategy

 Choose fields with reliable data


 Choose fields with a good distribution of values
 Combinations of fields may be used
Examples of Blocking Strategies

 Zip code for matching addresses


 NYSIIS of last name for matching individuals
 Brand name for matching products
 Combination of zip code and NYSIIS of street name
for matching addresses
 Combination of NYSIIS of last name and first letter of
first name for matching individuals
Blocking Summary

 Blocking groups together “like” records


 Matching is more efficient for small block sizes
 Blocks should be between 100 and 200 records
 Blocking fields must match exactly for a candidate set to
be created/evaluated
Match Types

 Unduplication
 Identifies duplicates candidates in one file
 Match (Two File)
 One-to-one correspondence
 For every record on File A we expect to find a match to one
record on File B
 Geomatch (Two File)
 Many-to-one correspondence
 More than one record on File A can match to the same record
on File B
Comparing Data Values
 Different comparisons for different data
 Over 24 comparison methods
 Most common
 CHAR - (character comparison) character by character, left to right.
 UNCERT - (character uncertainty) tolerates phonetic errors,
transpositions, random insertion, deletion, and replacement of characters
 CNT_DIFF – Counts keying errors in numeric fields. You set a tolerance
threshold
 NAME_UNCERT – Can be used to compare and character values, if the
strings are different lengths then the shorter of the two lengths is used
Exercise 8-1:
Undup Match
1. Define the output file
2. Define the Match Stage
3. Define the pass
• Choose blocking fields
4. Choose fields to compare and comparison method
5. Build the Match Extract
6. Create the Pass
Match Output Files

Contains the raw match results


Match Extract including the WEIGHT, TYPE of match
of records and SETID

Report includes matched records and


Match Report Summary Statistics

Contains the histogram, tables of


Match Statistics Report weights and summary statistics
Match Extract
SETID | TYPE |PASS| WEIGHT|ALL_OF_THE_DATA

393 | XA | 1 | 55.32 | MICHAEL F DOHERTY


393 | DA | 1 | 41.36 | MICHAEL F DOUGHERTY

468 | XA | 1 | 50.40 | EUGENE B BOROWITZ


468 | DA | 1 | 24.01 | BOROWITZ FAMILY TRUST
468 | DA | 1 | 47.26 | GENE BOROWITZ

520 | XA | 1 | 52.75 | FRAN X GEROSA


520 | DA | 1 | 40.95 | FRANCIS XAVIER GEROSA
520 | DA | 1 | 52.75 | FRANCIS X GEROSA
520 | DA | 1 | 41.22 | FRANK X GEROSA

1035 | RA | 1 | DARRYL F LINDBERG


Custom Extract Specification

MOVE @SET  This is an example of a common


MOVE " "
match extract specification
MOVE @TYPE
MOVE @PASS  It should match the output file
MOVE " "
defined in the previous exercised
MOVE @WGT
MOVE " "  The data is moved to the output file
MOVEALL OF A
according to these commands

SetID TYPE PASS WEIGHT Rest of Fields

1-9 10 11-12 13 14 15-21 22 23


Exercise 8-2:
Custom Match Extract
1. Select Extract Type
2. Select Output File
3. Enter Extract commands
Match Improvement Strategy

1. Set critical values for important fields


2. Review calculated weights
 Adjust weights using weight overrides
3. Set cutoffs
4. Add additional passes
Critical Fields

 Used to identify fields that must agree in order for


records to be linked
 Critical – Fields values must agree exactly or the records
cannot be linked (considered a match)
 Critical Missing OK – Field values must agree exactly on
values not considered “missing values”
 QualityStage feature: VARTYPE
Weight Overrides

 Allows you to adjust both the agreement and/or


disagreement weights for specific situations
 Add to calculated weight
 Replace weight
Exercise 8-3:
Critical Vartypes
1. Modify the Stage
2. Modify the Pass
3. Add additional Match fields
4. Re-run the Match Job
5. Review Results
Cutoffs
 There are two cutoffs
 Match cutoff (high cutoff)
 Clerical cutoff (low cutoff)
 Records with a weight equal to or above the Match cutoff are
considered matches
 Records with a weight below the low cutoff are not matches
 Records with a weight greater than or equal to the low cutoff and
less than the high cutoff are considered clerical records for
manual review
 Cutoffs can be set at the same value eliminating clerical records
Setting the Match Cut-off

Weights Data fields


Definite 27.82 PO BOX 930202
27.82 PO BOX 930202
Match
27.82 PO BOX 930202

Definite 38.65 35 COLLIER RD NW STE 610


Match 38.65 35 COLLIER RD NW STE 610

Questionable 25.81 928 S 1ST ST


Match 14.45 S 1ST ST
Exercise 8-4:
Set Match Cutoffs
1. Modify the Match Stage
2. Modify Pass 1
3. Set Cutoffs
4. Re-run the Match Job
5. Review Results
Multiple Match Passes

 Additional passes are helpful in overcoming data errors


and missing values in block fields
 You should always create at least two match passes
 Change blocking strategies for each pass
Example: Multiple Match Passes
Pass Weights Data fields

1 26.31 JASON BIRCH 1350 WALTON WAY 30901


1 26.31 JASON BIRSH 1350 WALTON WAY 30901

1 20.42 JOHN SMITH 2047 PRINCE AVE 30604


1 10.83 MARY SMITH 2047 PRINCE AVE 30604

1 RES A JOHN SMITH P.O. BOX 123 30604

2 20.42 JOHN SMITH 2047 PRINCE AVE 30604


2 10.19 JOHN SMITH P.O. BOX 123 30604

 Pass 1 blocked on street name


 Pass 2 found additional matched records in which the street
name was different but the names were the same
Exercise 8-5:
Add Match Pass 2
1. Modify the Match Stage
2. Add a new Pass
3. Choose Block Fields
4. Choose Match Fields
5. Run Job
6. Review Results
Module Summary
 Three type of matches
 Undup
 Match
 Geomatch
 Block to group together like records
 Only like records are compared adding computational efficiency
 Over 24 match comparisons
 Critical fields
 Match cutoffs
 Multiple passes
Survive
Module Objectives

 Describe where Survive is in the Data Re-engineering


Methodology
 Identify Survive techniques
 Describe implementation options
 Define Survive rules
 Build Survive stage
Data Re-engineering Methods

Data Quality
Analyze free Data Re-Engineering
Assessment
form fields
(DRE)
(DQA)

I II II IV
Investigation Standardize I Match Survive
High-Level DFD
Select US Data
for further
 Reject NON US Data
AUTOHOME LIFE processing

Pre-Process US Data


Investigate
Assess Data Quality
Condition Name,
Address and Area  
Identify Duplicate
Customer Records


Add Unique Key
Append Data to a
common format  Investigate
Conditioned Results

Survive the Best
Customer Record

Standardize Country  Apply User Overrides



Survive Stage

 Point-and-click creation of business rules to determine


“surviving” data – user decides how to survive data
 Performed at record or field level – very flexible
 Creates a single, consolidated record containing the
“best-of-breed” data
 Cross-populates best available data
 Creates a cross-reference key
 Provides consolidated view of the data
Survive Example
Survive Input (Match Output)
Group Legacy First Middle Last No. Dir. Str. Name Type Unit No.
1 D150 Bob Dixon 1500 SE ROSS CLARK CIR
1 A1367 Robert Dickson 1500 ROSS CLARK CIR

23 D689 William A Obrian 5901 SW 74TH ST STE 202


23 A436 Billy Alex O’Brian 5901 SW 74TH ST
23 D352 William Obrian 5901 74 ST # 202

Survived Consolidated Output


Group Legacy First Middle Last No. Dir. Str. Name Type Unit No.
1 D150 Robert Dickson 1500 SE ROSS CLARK CIR

23 D689 William Alex O’Brian 5901 SW 74TH ST STE 202

Cross-Reference File
Group Legacy
1 D150
1 A1367

23 D689
23 A436
23 D352
Survive Rules

 A rule contains a condition and a set of target fields


 When the condition is met the field becomes a candidate for
the “best”
 All records in a group are tested against the condition
 The “best” populates the target fields
 Multiple targets are permitted for the same rule
Survive Rules
 Custom Rule
 Build your own logical expression
 Comparison (=, !=, <, > ,<=, >=)
 Logical (and, or, not)
 Indicate the current and best records with the following
notation
 c.field indicates the current
 b.field indicates the best
 Parentheses ( ) can be used for grouping complex conditions
 String literals are enclosed in double quotation marks, such as
“MARS”.
 A semicolon (;) terminates a rule.
Building Survive Rules
 Survive Rules Definition
screen lets you easily
build, delete and manage
survivor rules
Survive Techniques

 Pre-defined Techniques
 Source
 Recency
 Frequency
 Most complete (longest string)
 User-specified logic
Target Fields

 Fields you want to write to the output file


 Populated based on meeting the conditions of the
survivor rule(s)
 Fields not listed as targets are excluded from the output
file
 May have multiple targets for each rule
Example: Complex Survive Rule
 The following rule states that FIELD3 of the current record
should be retained if the field contains five or more characters
and FIELD1 has any contents.
 The prefix of b. indicates the current “best” record
 The prefix c. indicates the current record testing against the
survivor rule

FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;

TARGET CONDITION
Exercise 9-1:
Survive the Best Customer Record
1. Define the output file
2. Define Survive stage
3. Choose target fields
4. Define Survive rules
5. Deploy and run
6. Review results
Module Summary

 Consolidate or survive the best record by choosing the


best record or best field from multiple records
 Use pre-defined techniques or build your own
 May use multiple rules

You might also like