Real-World Data Is Dirty

Data Cleansing and the Merge/Purge Problem
Hernandez & Stolfo: Columbia University - 1998

Class Presentation by Rhonda Kost, 06.April 2004

TOPICS
Introduction A Basic Data Cleansing Solution Test & Real World Results Incremental Merge Purge w/ New Data Conclusion Recap

rmk 06.April.2004

Introduction

The problem:
Some corporations acquire large amounts of information every month The data is stored in many large databases (DB) These databases may be heterogeneous 

Variations in schema

The data may be represented differently across the various datasets Data in these DB may simply be inaccurate
rmk 06.April.2004

Requirement of the analysis
The data mining needs to be done 
 

Quickly Efficiently Accurately

rmk 06.April.2004

Examples of real-world applications
Credit card companies 


Assess risk of potential new customers Find false identities

Match disparate records concerning a customer 


Mass Marketing companies Government agencies

rmk 06.April.2004

A Basic Data Cleansing Solution

Duplicate Elimination
Sorted-Neighborhood Method (SNM) This is done in three phases 
 

Create a Key for each record Sort records on this key Merge/Purge records

rmk 06.April.2004

SNM: Create key
Compute a key for each record by extracting relevant fields or portions of fields Example:
First Sal Last Stolfo Address 123 First Street ID 45678987 Key STLSAL123FRST456

rmk 06.April.2004

SNM: Sort Data
Sort the records in the data list using the key in step 1 This can be very time consuming 


O(NlogN) for a good algorithm, O(N2) for a bad algorithm

rmk 06.April.2004

SNM: Merge records
Move a fixed size window through the sequential list of records. This limits the comparisons to the records in the window

rmk 06.April.2004

SNM: Considerations
What is the optimal window size while 


Maximizing accuracy Minimizing computational cost

Execution time for large DB will be bound by 


Disk I/O Number of passes over the data set

rmk 06.April.2004

Selection of Keys
The effectiveness of the SNM highly depends on the key selected to sort the records A key is defined to be a sequence of a subset of attributes Keys must provide sufficient discriminating power
rmk 06.April.2004

Example of Records and Keys
First Sal Sal Sal Sal Last Stolfo Stolfo Stolpho Stiles Address 123 First Street 123 First Street 123 First Street 123 Forest Street ID 45678987 45678987 45678987 45654321 Key STLSAL123FRST456 STLSAL123FRST456 STLSAL123FRST456 STLSAL123FRST456

rmk 06.April.2004

Equational Theory
The comparison during the merge phase is an inferential process Compares much more information than simply the key The more information there is, the better inferences can be made

rmk 06.April.2004

Equational Theory - Example
Two names are spelled nearly identically and have the same address 

It may be inferred that they are the same person

Two social security numbers are the same but the names and addresses are totally different 


Could be the same person who moved Could be two different people and there is an error in the social security number

rmk 06.April.2004

A simplified rule in English
Given two records, r1 and r2 IF the last name of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2 THEN r1 is equivalent to r2

rmk 06.April.2004

The distance function
A distance function is used to compare pieces of data (usually text) Apply distance function to data that differ slightly Select a threshold to capture obvious typographical errors. 

Impacts number of successful matches and number of false positives
rmk 06.April.2004

Examples of matched records
SSN 334600443 334600443 525520001 525520001 0 0 789912345 879912345 879912345 879912345 Name (First, Initial, Last) Lisa Boardman Lisa Brown Ramon Bonilla Raymond Bonilla Diana D. Ambrosion Diana A. Dambrosion Kathi Kason Kathy Kason Kathy Kason Kathy Smith Address 144 Wars St. 144 Ward St. 38 Ward St. 38 Ward St. 40 Brik Church Av. 40 Brick Church Av. 48 North St. 48 North St. 48 North St. 48 North St.

rmk 06.April.2004

Building an equational theory
The process of creating a good equational theory is similar to the process of creating a good knowledgebase for an expert system In complex problems, an expert s assistance is needed to write the equational theory

rmk 06.April.2004

Transitive Closure
In general, no single pass (i.e. no single key) will be sufficient to catch all matching records An attribute that appears first in the key has higher discriminating power than those appearing after them 

If an employee has two records in a DB with SSN 193456782 and 913456782, it s unlikely they will fall under the same window

rmk 06.April.2004

Transitive Closure
To increase the number of similar records merged 


Widen the scanning window size, w Execute several independent runs of the SNM 
Use a different key each time  Use a relatively small window  Call this the Multi-Pass approach

rmk 06.April.2004

Transitive Closure
Each independent run of the Multi-Pass approach will produce a set of pairs of records Although one field in a record may be in error, another field may not Transitive closure can be applied to those pairs to be merged
rmk 06.April.2004

Multi-pass Matches
Pass 1 (Lastname discriminates)
KSNKAT48NRTH789 (Kathi Kason 789912345 ) KSNKAT48NRTH879 (Kathy Kason 879912345 )

Pass 2 (Firstname discriminates)
KATKSN48NRTH789 (Kathi Kason 789912345 ) KATKSN48NRTH879 (Kathy Kason 879912345 )

Pass 3 (Address discriminates)
48NRTH879KSNKAT (Kathy Kason 879912345 ) 48NRTH879SMTKAT (Kathy Smith 879912345 )
rmk 06.April.2004

Transitive Equality Example
IF A implies B AND B implies C THEN A implies C From example:
789912345 Kathi Kason 48 North St. (A) 879912345 Kathy Kason 48 North St. (B) 879912345 Kathy Smith 48 North St. (C)

rmk 06.April.2004

Test Results

Test Environment
Test data was created by a database generator 

Names are randomly chosen from a list of 63000 real names

The database generator provides a large number of parameters: 
 

size of the DB, percentage of duplicates, amount of error
rmk 06.April.2004

Correct Duplicate Detection

rmk 06.April.2004

Time for each run

rmk 06.April.2004

Accuracy for each run

rmk 06.April.2004

Real-World Test
Data was obtained from the Office of Children Administrative Research (OCAR) of the Department of Social and Health Services (State of Washington) OCAR s goals 


How long do children stay in foster care? How many different homes do children typically stay in?
rmk 06.April.2004

OCAR s Database
Most of OCAR s data is stored in one relation The DB contains 6,000,000 total records The DB grows by about 50,000 records per month

rmk 06.April.2004

Typical Problems in the DB
Names are frequently misspelled SSN or birthdays are either missing or clearly wrong Case number often changes when the child s family moves to another part of the state Some records use service provider names instead of the child s No reliable unique identifier
rmk 06.April.2004

OCAR Equational Theory
Keys for the independent runs 
 

Last Name, First Name, SSN, Case Number First Name, Last Name, SSN, Case Number Case Number, First Name, Last Name, SSN

rmk 06.April.2004

OCAR Results

rmk 06.April.2004

Incremental Merge/Purge w/ New Data

Incremental Merge/Purge
Lists are concatenated for first time processing Concatenating new data before reapplying the merge/purge process may be very expensive in both time and space An incremental merge/purge approach is needed: Prime Representatives method

rmk 06.April.2004

Prime-Representative: Definition
A set of records extracted from each cluster of records used to represent the information in the cluster The Cluster Centroid or base element of equivalence class

rmk 06.April.2004

Prime-Representative creation
Initially, no PR exists After the execution of the first merge/purge create clusters of similiar records Correct selection of PR from cluster impacts accuracy of results No PR can be the best selection for some clusters
rmk 06.April.2004

3 Strategies for Choosing PR
Random Sample 

Select a sample of records at random from each cluster Most recent elements entered in DB Choose the largest or more complete record
rmk 06.April.2004

N-Latest 

Syntactic 

Important Assumption
No data previously used to select each cluster s PR will be deleted 

Deleted records could require restructuring of clusters (expensive)

No changes in the rule-set will occur after the first increment of data is processed 

Substantial rule change could invalidate clusters.
rmk 06.April.2004

Results
Cumulative running time for the Incremental Merge/Purge algorithm is higher than the classic algorithm PR selection methodology could improve cumulative running time Total running time of the Incremental Merge/Purge algorithm is always smaller
rmk 06.April.2004

Conclusion

Cleansing of Data
Sorted-Neighborhood Method is expensive due to 


the sorting phase the need for large windows for high accuracy

Multiple passes with small windows followed by transitive closure improves accuracy and performance for level of accuracy 


increasing number of successful matches decreasing number of false positives
rmk 06.April.2004

Recap

2 major reasons merging large databases becomes a difficult problem: 


The databases are heterogeneous The identifiers or strings differ in how they are represented within each DB

rmk 06.April.2004

The 3 steps in SNM are: 
 

Creation of key(s) Sorting records on this key Merge/Purge records

rmk 06.April.2004

Prime representative - set of records from cluster considered to be representative of data contained in cluster 3 strategies for selecting a PR: 
 

Random Sample N-Latest Syntactic
rmk 06.April.2004

Questions:

rmk 06.April.2004

Sign up to vote on this title
UsefulNot useful