You are on page 1of 30

DATA CLEANSING /

SCRUBBING
Why cleanse/scrub data?
“A company’s most important asset is information. A
corporation’s ability to compete, adapt, and grow in a
business climate of rapid change is dependent in large
measure on how well the company uses information
to make decisions.
Sharing information that isn’t clean and consolidated
to the fullest extent can substantially reduce the
effectiveness of a system of significant investment
and considerable pay-off potential.”
What is data cleansing?
Data cleansing or data cleaning is the process
of detecting and correcting (or removing)
corrupt or inaccurate records from a record
set, table, or database
It aims to identify incomplete, incorrect,
inaccurate or irrelevant parts of the data with
the intention to replace, modify, or delete the
dirty or coarse data
• Data cleansing can occur within a single set of
records, or between multiple sets of data which
need to be merged, or which will work together.

• Typos and spelling errors are corrected,


mislabeled data is properly labeled and filed, and
incomplete or missing entries are completed.

• In more complex operations, data cleansing can


be performed by computer programs. These data
cleansing programs can check the data with a
variety of rules and procedures decided upon by
the user
• The goal of data cleansing is not just to clean up
the data in a database but also to bring
consistency to different sets of data that have
been merged from separate databases.
When do say we have messy data?
Dummy Values
Absence of Data
Multipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys
Non-Unique Identifiers
Data Integration Problems
Problems with Messy Data
Messy Data Originating from One Source
Database
Messy Data Originating from Multiple
Database Sources
Another example of a messy data set example (.csv)
Another example of a messy data set example (.csv)
Another example of a messy data set example
Another example of a messy data set
Another example of a messy data set
What to do with messy data?
What to do with messy data?
Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing
Parsing locates and identifies individual
data elements in the source files and then
isolates these data elements in the target
files.
Parsing
Parsed Data in Target File
First Name: Beth
Middle Name: Christine
Input Data from Source File Last Name: Parker
Beth Christine Parker, SLS MGR Title: SLS MGR
Regional Port Authority Firm: Regional Port Authority
Federal Building Location: Federal Building
12800 Lake Calumet Number: 12800
Hedgewisch, IL Street: Lake Calumet
City: Hedgewisch
State: IL
Correcting
Corrects parsed individual data components
using sophisticated data algorithms and/or
secondary data sources.
Correcting
Corrected Data
Parsed Data First Name: Beth
First Name: Beth Middle Name: Christine
Middle Name: Christine Last Name: Parker
Last Name: Parker Title: SLS MGR
Title: SLS MGR Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: South Butler Drive
Street: Lake Calumet City: Chicago
City: Hedgewisch State: IL
State: IL Zip: 60633
Zip+Four: 2398

Here we have an error in the address field (i.e. for some reason, the address as
reflected in the source is not the correct address) so we correct it
Standardizing

Standardizing applies conversion routines to


transform data into its preferred (and
consistent) format using both standard and
custom business rules.
Standardizing
Corrected Data
Corrected Data Pre-name: Ms.
First Name: Beth First Name: Beth
Middle Name: Christine 1st Name Match
Last Name: Parker Standards: Elizabeth, Bethany, Bethel
Title: SLS MGR Middle Name: Christine
Firm: Regional Port Authority Last Name: Parker
Location: Federal Building Title: Sales Mgr.
Number: 12800 Firm: Regional Port Authority
Street: South Butler Drive Location: Federal Building
City: Chicago Number: 12800
State: IL Street: S. Butler Dr.
Zip: 60633 City: Chicago
Zip+Four: 2398 State: IL
Zip: 60633
Zip+Four: 2398

Here we prefix all females with Ms., and applied the standard format for job title and street.
For first name, we added a new field 1st Name Match Standards to capture possible original
names (useful for building a criminal database wherein the alleged criminal uses alias
instead of real names)
Matching
Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
Matching
Corrected Data (Data Source #2)
Corrected Data (Data Source #1) Pre-name: Ms.
Pre-name: Ms. First Name: Elizabeth
First Name: Beth 1st Name Match
1st Name Match Standards: Beth, Bethany, Bethel
Standards: Elizabeth, Bethany, Bethel Middle Name: Christine
Middle Name: Christine Last Name: Parker-Lewis
Last Name: Parker Title:
Title: Sales Mgr. Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: S. Butler Dr., Suite 2
Street: S. Butler Dr. City: Chicago
City: Chicago State: IL
State: IL Zip: 60633
Zip: 60633 Zip+Four: 2398
Zip+Four: 2398 Phone: 708-555-1234
Fax: 708-555-5678
Consolidating
Analyzing and identifying relationships between
matched records and consolidating/merging them
into ONE representation.
Consolidating
Consolidated Data
Name: Ms. Beth (Elizabeth)
Corrected Data (Data Source #1) Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Corrected Data (Data Source #2)
Phone: 708-555-1234
Fax: 708-555-5678
Recommended Best Practices
1. Use metadata to document rules
2. Determine data cleansing schedule
3. Build quality into new and existing systems
CONCLUSION

Hence we conclude that data cleansing is not


only an effective tool for removing unwanted,
“dirty” data, but also the medium to make data
in our databases and systems concise, selective
and appropriate
References
en.wikipedia.org/wiki/Data_cleansing
www.webopedia.com/TERM/D/data_cleansing
.html
https://www.slideshare.net/ramakantsoni/role-of-d
ata-cleaning-rk/13
https://www.ibm.com/developerworks/library/ba-c
leanse-process-visualize-data-set-1/index.html
https://www.slideshare.net/Salesforce/best-practice
s-for-keeping-your-data-clean

You might also like