DATA WAREHOUSE & QUALITY ISSUES

Information Search and Analysis Skills
Venue : NIIT Ltd, Agra.

Date: 15 Dec 2008 Semester: 4

Credits: Amol Shrivastav Mohit Bhaduria Harsha Rajwanshi

Guidance & support

Gunjan Verma

Contents
    

Introduction Measuring Data Quality Tools for Data Quality Data Quality Methodology ETL

Section 1
By Amol Shrivastav

A producer wants to know….
Which are our lowest/highest margin customers ? What is the most effective distribution channel? Who are my customers and what products are they buying?

What product prom-otions have the biggest impact on revenue? What impact will new products/services have on revenue and margins?

Which customers are most likely to go to the competition ?

Data, Data everywhere yet ... I can’t find the data I need

 

data is scattered over the network many versions, subtle differences

I can’t get the data I need

need an expert to get the data

I can’t understand the data I found

available data poorly documented

I can’t use the data I found
 

results are unexpected data needs to be transformed from one form to other

What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

Data Flow

Section 2
By Mohit Bhaduria

Measurin g Data Quality

Attributes for measuring Data do I know what the fields Quality the mean, do I know when
data I’m using was last updated?
usefulness

Believability.

Data Quality

Accessibility

Interpretability

Attributes for measuring Data do I know what the fields Quality data Usefulness is the mean, do I know when the
relevant for data I’m using my needs? Is the wasdata updated? last current?
usefulness

Believability.

Data Quality

Accessibility

Interpretability

Attributes for measuring Data do I know what the fields Quality data Usefulness is the mean, do I know when the
am I missing too much data? relevant for data I’m using my needs? Is the Are there wasdata updated? last current? strong biais? Is the data quality consistent?
usefulness

Believability.

Data Quality

Accessibility

Interpretability

Attributes for measuring Data do I know what the fields Qualitydata Usefulness is when the mean, do I knowthe
am I for much data? relevantmissing too who need data I’m using my needs? Is the to do the people Are there data current?strong biais? Is have was last updated? to the datathe have access data quality the proper access? consistent? Is the system crashing or too slow?
usefulness

Believability.

Data Quality

Accessibility

Interpretability

Linking Quality Factors to DW quality
Data warehouse quality

Accessibility

Interpretability

usefulness

believability

Validation

Data Sources Data warehouse Design Data warehouse process

DW Design Models Language Query processing DW Data & process

Update policy DW evolution Data sources DW Design & process

DW Sources, Design , and process

DW process

Quality metamodel
The quality meta model can be used for both design and analysis purposes. The DWQ quality metamodel is based on the Goal-QuestionMetric approach

DWQM is continuous process in life DW

Section 3
By Harsha Rajwanshi

Tools for Data Warehouse Quality

Tools for Data quality

The tools that may be used to extract/transform/clean the source data or to measure/control the quality of the inserted data can be grouped in the following categories Data auditing tools. Data Cleansing tools. Data Migration tools. Data Quality Analysis tools.

   

Tools for Data quality

Data auditing tools enhance the accuracy and correctness of the data at the source. Data cleansing tools are used in the intermediate staging area. The data cleansing tools contain features which perform the following functions:
Data parsing (elementising) Data standardization Data correction and verification Record matching Data transformation House holding Documenting


     

Data migration tool, is responsible for converting the data from one platform to another. SQL Loader module of Oracle, Carleton’s Pure Integrate (formerly known as Enterprise/Integrator), ETI Data Cleanse, EDD Data Cleanser tool and Integrity from Vality can be used for the application of rules that govern data cleaning, typical integration tasks etc.

Data Quality Methodology

Data Quality Methodology

Profiling and Assessment Cleansing Data integration /consolidation Data Augmentation

Profiling & Assessment

There are many different techniques and processes for data profiling. Grouping them together into three major categories:

Pattern Analysis – Expected patterns, pattern distribution, pattern frequency and drill down analysis Column Analysis – Cardinality, null values, ranges, minimum/maximum values, frequency distribution and various statistics. Domain Analysis – Expected or accepted data values and ranges

Cleansing

  

Data Cleansing focus on 3 main categories: Business Rule Creation, Standardizing and Parsing.

Data Integration and consolidation
 

The data can also be linked implicitly by defining join criteria on similar values using a generated unique value or match codes based on fuzzy logic algorithms Determine what process to follow to consolidate/combine or remove redundant data.

Extraction, Transformati on & Loading in DW

Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Static extract = capturing a snapshot of the source data at a point in time

Incremental extract = capturing changes that have occurred since the last static extract

Scrub = cleanse…uses pattern recognition and techniques to upgrade data quality

Fixing errors: misspellings,
erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting, time
stamping, conversion, key generation, merging, error detection/logging, locating missing data

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode: bulk rewriting of
target data at periodic intervals

Update mode: only changes in source
data are written to data warehouse

Queries

Thank You!!!