You are on page 1of 64

The Need For Data Quality

Difficulty in decision making
Time delays in operation
Organizational mistrust
Data ownership conflicts
Customer attrition
Costs associated with
error detection
error rework
customer service
fixing customer problems
Poor Data Quality Impacts The Bottom Line
Six Steps To Data Quality
Understand
Information Flow
In Organization
Identify Potential
Problem Areas &
Asses Impact
Clean & Load
Data
Continuous
Monitoring
Identify Areas of
Improvement
1
2
3
4
5
6
 Identify authoritative data sources
 Interview Employees & Customers
Use business rule discovery tools to identify
data with inconsistent, missing, incomplete,
duplicate or incorrect values
Ratio of bad records to good records
Use data cleansing tools to clean data at the
source
Load only clean data into the data warehouse
Measure Quality
Of Data
Schedule Periodic Cleansing of Source Data
Identify & Correct Cause of Defects
Refine data capture mechanisms at source
Educate users on importance of DQ
Data Entry Points
Cost of bad data
Data Quality Analysis
 Identify Inconsistencies
 Codification differences
 Multiple database entries of same entity
 Out of range or missing values
 Frequency Distribution
 Uniqueness of keys
 Referential Integrity
 Compliance to other business rules

Examine Source Systems For Inconsistencies
Tools In The Market
 Business Rule Discovery Tools
 Integrity Data Reengineering Tool from Vality Technology
 Trillium Software System from Harte-Hanks Data Technologies
 Migration Architect from DB Star
 WizRule, WizWhy, from WizSoft Inc
 Data Reengineering & Cleansing Tools
 Prism from Prism Solutions
 Carlton Pureview from Oracle
 ETI-Extract from Evolutionary Technologies
 PowerMart from Informatica Corp
 Sagent Data Mart from Sagent Technology
Tools In The Market
Name & Address Cleansing Tools
Centrus Suite from Sagent
I.d.centric from First Logic
Quality Defect Prevention Tools
Centrus Suite from Sagent
I.d.centric from First Logic
Data Quality Assessment Tools
Migration Architect, Evoke Axio from Evoke Software
Wizrule from Wizsoft
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
 e.g., occupation=“ ”
noisy: containing errors or outliers
 e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Forms of Data Preprocessing
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q
1
(25
th
percentile), Q
3
(75
th
percentile)
 Inter-quartile range: IQR = Q
3


Q
1
 Boxplot: ends of the box are the quartiles, median is marked,
and plot outlier individually
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)


 Standard deviation s (or σ) is the square root of variance s
2 (
or

σ
2)
  
  


 


n
i
n
i
i i
n
i
i
x
n
x
n
x x
n
s
1 1
2
2
1
2 2
] ) (
1
[
1
1
) (
1
1
 
 
   
n
i
i
n
i
i
x
N
x
N
1
2
2
1
2 2
1
) (
1
  
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: use inference-based formula such as
Bayesian formula or decision tree
Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with
possible outliers)
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression
x
y
y = x + 1
X1
Y1
Y1’
Cluster Analysis
Data Cleaning as a Process


 Data discrepancy detection (not wanting to give details, outdated
address, poorly designed forms, too many options for questions)


 Use any knowledge say metadata (e.g., domain, range, dependency,
distribution) your write your own scripts.


 Check field overloading (2004/12/25, 25/12/2004.)


 Check uniqueness rule, consecutive rule and null rule(zero, person
refusing to provide, blanks)
Data cleaning contd.
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections use
parsing, fuzzy matching techniques)
 Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
 Integration of the two processes
(www.control.cs.berkeley.edu.abc
 Iterative and interactive (e.g., Potter’s Wheels)
 Work in progress: writing declarative languages using
SQL for data cleaning
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units











ETL

Ralph Speaks
Technical Design Challenges Posed
By The Data Warehouse Evolution

• Timeliness
• Data Volumes
• Response Times
BI
Legacy
Systems
…..
B2C
B2B
CRM
The Big Picture!
Which Approach Do We Take ?
Data Extraction and Preparation

Extract
Analyze, Clean
and Transform
Data Movement
and Load
Stage I
Stage II
Stage III
Periodic
Refresh/
Update
The ETL Process
Access data dictionaries defining source files
Build logical and physical data models for target
data
Survey existing systems to identify sources of
data
Specify business and technical rules for data
extraction, conversion and transformation
Perform data extraction and transformation
Load target databases
Metadata
Repository
Data
Definitions
Source
Databases
Data Modeling
Tool
RDBMS
MDDB
Define/code
Extraction Rules
Extract Program
Generation
Run Extract
Programs
Load Data
Warehouse
Source
Metadata Target
Metadata
Raw
Data
Clean
Data
The ETL Process
OLTP
Systems
Staging
Area
Data
Warehouse
Extract
OLTP
Systems
OLTP
Systems
Load
Transform
Stage I
Stage II
Stage III
The ETL Process
Data Extraction - Simplified
ETL Tools - Classification
First-generation
Code-generation products
Generate the source code
Second-generation
Engine-driven products
Generate directly executable code
ETL Tools - Classification
 Due to more efficient architecture, second
generation tools have significant advantage over
first-generation
ETL Tools - First-Generation
 Strengths
 Tools are mature
 Programmers are familiar
with code generation in
COBOL or C
 Limitations
 High cost of products
 Complex training
 Extract programs have to
compiled from source
 Many transformations have
to coded manually
 Lack of parallel execution
support
 Most metadata to be
manually generated
Characterized by the Generation and Deployment of Multiple Codes
ETL Tools - Second-Generation
 Extraction/Transformation/Load runs on server
 Data directly extracted from source and processed on server
 Data transformation in memory and written directly to warehouse
database. High throughput since intermediate files are not used
 Directly executable code
 Support for monitoring, scheduling, extraction, scrubbing,
transformation, load, index, aggregation, metadata
Characterized by the Transformation Engine
ETL Tools - Second-Generation
PowerCentre/Mart from Informatica
Data Mart Solution from Sagent Technology
DataStage from Ascential
ETL Tools - Selection
Support to retrieve, cleanse, transform,
summarize, aggregate, and load data
Engine-driven products for fast, parallel operation
Generate and manage central metadata repository
Open metadata exchange architecture
Provide end-users with access to metadata in
business terms
Support development of logical and physical data
models
Data Loading - First Time Loads
First load is a complex
exercise
Data extracted from
tapes, files, archives etc.
First time load might take
several days to complete
Extract, Clean,Transform etc
Source: www.survey.com
ETL Trends
DWH market is growing at 40-45% p.a.
Meta data management is shaping the market
Real time CRM requires real time DWH
E-comm and E-business are fuelling DWH & BI
ERP Data Warehousing is in demand
Major Trends
Source: Cutter Report - May 2000
ETL Trends
ETL technology built into other BI products
XML enabled platform independent data traffic
Near Real Time Data Warehouses using
middleware
Vendors have evolved their products into data
mart/analytical platforms









The Data Mart Strategy
 The most common approach
 Begins with a single mart and architected marts are added
over time for more subject areas
 Relatively inexpensive and easy to implement
 Can be used as a proof of concept for data warehousing
 Can perpetuate the “silos of information” problem
 Can postpone difficult decisions and activities
 Requires an overall integration plan
Data Sources and Types
Primarily from legacy, operational systems
Almost exclusively numerical data at the present
time
External data may be included, often purchased
from third-party sources
Technology exists for storing unstructured data
and expect this to become more important over
time
Extraction, Transformation,
and Loading (ETL) Processes
The “plumbing” work of data warehousing
Data are moved from source to target data bases
A very costly, time consuming part of data
warehousing

Recent Development:
More Frequent Updates
Updates can be done in bulk and trickle modes
Business requirements, such as trading partner
access to a Web site, requires current data
For international firms, there is no good time to
load the warehouse
Recent Development:
Clickstream Data
Results from clicks at web sites
A dialog manager handles user interactions. An
ODS (operational data store in the data staging
area) helps to custom tailor the dialog
The clickstream data is filtered and parsed and
sent to a data warehouse where it is analyzed
Software is available to analyze the clickstream
data
Data Extraction
Often performed by COBOL routines
(not recommended because of high program
maintenance and no automatically generated meta
data)
Sometimes source data is copied to the target
database using the replication capabilities of
standard RDMS (not recommended because of
“dirty data” in the source systems)
Increasing performed by specialized ETL software
Sample ETL Tools
Teradata Warehouse Builder from Teradata
DataStage from Ascential Software
SAS System from SAS Institute
Power Mart/Power Center from Informatica
Sagent Solution from Sagent Software
Hummingbird Genio Suite from Hummingbird
Communications
Reasons for “Dirty” Data
 Dummy Values
 Absence of Data
 Multipurpose Fields
 Cryptic Data
 Contradicting Data
 Inappropriate Use of Address Lines
 Violation of Business Rules
 Reused Primary Keys,
 Non-Unique Identifiers
 Data Integration Problems
Data Cleansing
Source systems contain “dirty data” that must be
cleansed
ETL software contains rudimentary data cleansing
capabilities
Specialized data cleansing software is often used.
Important for performing name and address
correction and householding functions
Leading data cleansing vendors include Vality
(Integrity), Harte-Hanks (Trillium), and Firstlogic
(i.d.Centric)

Steps in Data Cleansing
 Parsing
 Correcting
 Standardizing
 Matching
 Consolidating

Parsing
Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
Examples include parsing the first, middle, and
last name; street number and street name; and
city and state.


Correcting
Corrects parsed individual data components using
sophisticated data algorithms and secondary data
sources.
Example include replacing a vanity address and
adding a zip code.
Standardizing
Standardizing applies conversion routines to
transform data into its preferred (and consistent)
format using both standard and custom business
rules.
Examples include adding a pre name, replacing a
nickname, and using a preferred street name.
Matching
Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
Examples include identifying similar names and
addresses.
Consolidating
 Analyzing and identifying relationships between
matched records and consolidating/merging them
into ONE representation.

Data Staging
 Often used as an interim step between data extraction and
later steps
 Accumulates data from asynchronous sources using
native interfaces, flat files, FTP sessions, or other
processes
 At a predefined cutoff time, data in the staging file is
transformed and loaded to the warehouse
 There is usually no end user access to the staging file
 An operational data store may be used for data staging

Data Transformation
Transforms the data in accordance with the
business rules and standards that have been
established
Example include: format changes, deduplication,
splitting up fields, replacement of codes, derived
values, and aggregates
Data Loading
Data are physically moved to the data warehouse
The loading takes place within a “load window”
The trend is to near real time updates of the data
warehouse as the warehouse is increasingly used
for operational applications
Meta Data
Data about data
Needed by both information technology personnel
and users
IT personnel need to know data sources and
targets; database, table and column names;
refresh schedules; data usage measures; etc.
Users need to know entity/attribute definitions;
reports/query tools available; report distribution
information; help desk contact information, etc.
Recent Development:
Meta Data Integration
A growing realization that meta data is critical to
data warehousing success
Progress is being made on getting vendors to
agree on standards and to incorporate the sharing
of meta data among their tools
Vendors like Microsoft, Computer Associates, and
Oracle have entered the meta data marketplace
with significant product offerings
That’s lots of ETL.
Come back to basics: let us revise