Professional Documents
Culture Documents
SCRUBBING
Why cleanse/scrub data?
“A company’s most important asset is information. A
corporation’s ability to compete, adapt, and grow in a
business climate of rapid change is dependent in large
measure on how well the company uses information
to make decisions.
Sharing information that isn’t clean and consolidated
to the fullest extent can substantially reduce the
effectiveness of a system of significant investment
and considerable pay-off potential.”
What is data cleansing?
Data cleansing or data cleaning is the process
of detecting and correcting (or removing)
corrupt or inaccurate records from a record
set, table, or database
It aims to identify incomplete, incorrect,
inaccurate or irrelevant parts of the data with
the intention to replace, modify, or delete the
dirty or coarse data
• Data cleansing can occur within a single set of
records, or between multiple sets of data which
need to be merged, or which will work together.
Here we have an error in the address field (i.e. for some reason, the address as
reflected in the source is not the correct address) so we correct it
Standardizing
Here we prefix all females with Ms., and applied the standard format for job title and street.
For first name, we added a new field 1st Name Match Standards to capture possible original
names (useful for building a criminal database wherein the alleged criminal uses alias
instead of real names)
Matching
Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
Matching
Corrected Data (Data Source #2)
Corrected Data (Data Source #1) Pre-name: Ms.
Pre-name: Ms. First Name: Elizabeth
First Name: Beth 1st Name Match
1st Name Match Standards: Beth, Bethany, Bethel
Standards: Elizabeth, Bethany, Bethel Middle Name: Christine
Middle Name: Christine Last Name: Parker-Lewis
Last Name: Parker Title:
Title: Sales Mgr. Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: S. Butler Dr., Suite 2
Street: S. Butler Dr. City: Chicago
City: Chicago State: IL
State: IL Zip: 60633
Zip: 60633 Zip+Four: 2398
Zip+Four: 2398 Phone: 708-555-1234
Fax: 708-555-5678
Consolidating
Analyzing and identifying relationships between
matched records and consolidating/merging them
into ONE representation.
Consolidating
Consolidated Data
Name: Ms. Beth (Elizabeth)
Corrected Data (Data Source #1) Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Corrected Data (Data Source #2)
Phone: 708-555-1234
Fax: 708-555-5678
Recommended Best Practices
1. Use metadata to document rules
2. Determine data cleansing schedule
3. Build quality into new and existing systems
CONCLUSION