You are on page 1of 18

Business Intelligence

Data Quality / Information Quality for Northwind

COLOGNE UNIVERSITY OF APPLIED SCIENCES


February 15, 2015
Mohammad Aminul Islam (MatrNo: 11103812)

List of Contents
1. Introduction .............................................................................. 1
2. Data & Information ................................................................... 1
3. What is data / Information quality ? ......................................... 2
4. Dimension of Data quality ......................................................... 3
5. Different approaches for data quality ...................................... 5
5.1.
Data Profiling...............................................................5
5.2.
Cleaning and Conforming.............................................8
6. Data quality analysis on Northwind database.........................10
7. Summary................................................................................15
8. Conclusion ............................................................................... 15
9. References..........................................................................................................................16

- i-

1 Introduction
Now most of the business owner prefers using data warehouse for their business. Using
data warehouse is the convenient way to handle overall business because they can observe the
overall business condition, easy for decision making and also easy to predict the business
future. Actually the management of the business are not much interested to look all activities;
they are more interested to see the reports or the summery of the business. The reports or the
summery are the different calculation of data in database. So, the data in the database should
be qualified because it will impact on decision making of the business. Otherwise lots of
problem will arise such as more users complain or wrong business direction etc. In Kimball
book talking about three important reasons why executives are more concern for data quality.
First, if I could see the data then I could manage my business better. Second, most of the data
sources are distributed; integrating disparate data sources are required. Third, sharply
increased demand of complain mean lake of qualified data.

2 Data and Information


Data is raw, unorganized facts that need to be processed. Data can be something simple
and seemingly random and useless until it is organized.1

Example: Each student marks are data


When data is processed, organized, structured or presented in a given context so as to make it
useful, it is called information.

Example: Average marks of all students are information.

See (Data Vs Information)

3 What is Data/Information Quality?


In computing, data quality is the reliability and application efficiency of data, particularly
when kept in a data warehouse. Data quality assurance (DQA) as the process of verifying the
reliability and efficiency of data.2

Data quality is an essential characteristic that determines the reliability of data for
making decisions. 3

Data are of high quality if, "they are fit for their intended uses in operations, decision
making and planning." (J. M. Juran). Alternatively, data are deemed of high quality if they
correctly represent the real-world construct to which they refer.

From the different definition of data quality we can say that data should be reliable,
represent the real world and also meet the decision making purpose.

See (Rouse, M. 2015)

See (IBM)

- 2-

3.1 Data Quality Dimension


Data

quality dimension

concern

about

accuracy,

availability,

completeness,

conformance, consistency, credibility, process ability, Relevance, Timeliness. 4

Accuracy

Timeliness

Availability

Relevance

Completeness
Data
Quality
Dimension

Processability

Conformance

Credibility

Consistency

Figure 1: Dimension of Data Quality

Accuracy: The accuracy of data is the correctly representation of the real world object,
situation or event.

See (Danette McGilvray)


- 3-

Example: Wrong name of the employee may be typing mistake.

Availability: The availability of data means the data accessible for long time without
any problem.
Example: Suppose our source data coming from a URL. When URL not available will
show 404 not found error.

Completeness: The completeness of data means the data items or data points necessarily
need to support the application which it is intended.

Example: Full name of the customer, address etc.

Conformance: The conformance of data means a set of rules or regulation for capturing,
description of the data.

Example: Standard date format.

Consistency: The consistency means the data cannot violet the own rules of the database.

Example: For integer data type cannot insert character data type in database.

Credibility: The credibility of data means the trust sources of data.

Process ability: Process ability data means that the data giving as input is understandable by
the machine or software.

Relevance: Relevance data means the data contain necessary information for support the
application.

Timelineness: The current state of data showing unnecessary delays.

- 4-

Different approaches for data quality

Data Profiling: According to Kimball Data profiling is technical analysis of data to describe
its content, consistency and structure. 5

Data profiling plays two roles strategic and tactical. When data source identified then data
profiling assessment determine its suitability for data warehouse and make go/no decision.

Data profiling is very critical stage for initiate any database; it incorporates source data from
external system. Allocation sufficient time and analysis to data profiling assessment give
designer a better solution and reduce project risk by identifying the potential data.
Best practices for data profiling6

Distinct count and percent: Analyzing the distinct values for each column will help to identify
the unique value within the source data. Identification of unique keys is the fundamental
requirements for database and ETL architecture. Especially when need to insert or update any
data from database we need this unique keys for do action on specific data.

Order ID
Order Date
Shipped Date
Ship Via
Ship Name
Ship City
Ship Region
Ship Country

5
6

Customer ID
Customer Name
Address
City
Region
Postal Code

See (Kimball, R., Ross)


See (tdwi)

- 5-

Zero, blank, null percent: Analyzing missing, blank, null values will help to identify
the potential data issue. This information help database or ETL architecture to set appropriate
default values or allow null to a target database column where data is unknown.

Field
Order ID
Order Date
Shipped Date
Ship Via
Ship Name
Ship City
Ship Country

Zero
0
500
50
40
20
20
400

Blank
0
200
20
100
400
400
200

Null
0
30
40
10
30
12
40

Percent
100%
30%
40%
35%
22%
25%
20%

Minimum, Maximum string length and type: Analyze string length of source data
will help to set length and type to a database. This is very important for big type database. It
save the space, increase the query performance by minimizing the table scan time. If the field
part of the index, keeping the data type in check will help the minimize index size, overhead
and scan times.

Field
Order ID
Order Date
Shipped
Date
Ship Via
Ship Name
Ship City
Ship Country

Minimum
6
10
10

Maximum
8
16
16

Type
integer
Date
Date

3
2
4
3

15
14
15
11

Varchar
Varchar
Varchar
Varchar

- 6-

Numerical and date range analysis: This analysis help to identify the numerical and
date values. Suppose we need only integer values but if we declare with precision then it take
more size than integer value. After observing date values then we can define which formant is
appropriate for database.

Blank

Field
Order ID
Order Date
Shipped
Date
Ship Via
Ship Name
Ship City
Ship Country

Data 1
123456
01.01.2015
03.01.2015

Data 2
123457
2015.01.02
04.01.2015

Data 3
123458
03.01.2015
05.01.2015

Dhaka
Bangladesh

Air
XXX
Kln
Germany

Bus
XXX
Cologne
Deutschland

Different
Same
Format
meaning
but
confusion

Pattern analysis: Checking pattern of the data will confirm that data field formatted
correctly..

Field
Customer ID
Customer Name
Address
Mobile No
E-mail
Websibe

Data 1
123456
Md. Aminul
Fuldaer str
017564879954
Aminul@yahoo.com
www.aminul.com

First
name,
Last
name

Data 2
123457
Mohammad Islam
Oranienstr
+4914756214789
holy@gmail.com
Go.com

Different
format

- 7-

Cleaning and Conforming Data: According to Kimball Cleaning and conforming


are the critical ETL system task. Extracting, delivering are simply move and load the data but
cleaning and conforming add value to the data and enhance value to the organization.
(Kimball, Ralph, The Data warehouse lifecycle toolkit, second edition), page 330).

Kimball says, 9 things which will help to address the data quality

Declare a high level commitment to the data quality culture

Drive process reengineering at the executive level

Spend money to improve data entry environment

Spend money to improve application integration

Spend money to change how process work

Promote end to end team awareness

Promote interdepartmental cooperation

Publicly celebrate data quality excellence

Continuously measure and improve data quality

Data cleansing system: The ETL data cleansing system fix the dirty data. At the same
time data warehouse providing the accurate picture of data capture by the organization`s
production system. Develop a ETL system which is capable of correcting, rejecting, or
loading data and easy to use structure, rules, standardization.

Quality screen: Quality screen is the heart of ETL system. Quality screen is a
test against data. If the test is successful then nothing happen but if the test get wrong data
then it keep record in the error record schema. There are three types of quality screen test

Column Screen test: This test happens within a single column. Such as
whether the column contains wrong values, null values, or the value fail the required format.
Structure Screen test: This test relationship of data among the columns.
Structure screen test primary key, foreign key, one to many relationship between fields in two
column.
Business rule screen: Implement more complex and not similar with column
or structure screen test. For example shipment date<delivery date.

- 8-

Error Event Schema: The error event schema is a centralized schema whose purpose
is record every error occurs in database with the date and time. By viewing the recorded error
it is possible to improve data quality.

Figure 1: Error event schema


Audit Dimension Assembler: The audit dimension is a special dimension for the ETL
system for each fact table. When each record created, it adds metadata with the table. This
metadata is available to BI application for visibility into data quality.

- 9-

6. Data quality analysis on Northwind Database

Address:
Full Name:

First name, middle name, last name

House No.
Country:
Region:
City:
Street:
Postal code:
Contact
Optional: Input person name

Every input should be specific.


Not in one line address.
Constant value for country, city, street, postal code.
Contact number should be a valid format
All fields are mandatory
If entry occurs by data entry operator then should include his name or id.

Order, shipment:
Order id, shipment id will be automatic.
When order or shipment create them date and time will auto add.
Order date < shipment date

For Region, Zip SAS Code


If region eq Not Specified or trim(region)eq or trim (region)eq then region=*;
If ZIP eq Not Specified or trim(ZIP)eq or trim (ZIP)eq then region=*;

Here, if region, zip code is not specified then set it as a *

- 10 -

Finding null value in the leader column

Figure: Dim Employee Dimension

SAS Code
proc sql;
select NMISS(Leader)
from West3.DIM_EMPLOYEE;

Results:

- 11 -

Replacing null value of Leader column

SAS code:
proc sql;
select COALESCE(Leader,0)
from West3.DIM_EMPLOYEE;

Results:

Here, replacing null value by 0 because the data type of the column is integer type.

- 12 -

Checking unique value of Bestell_Nr


SAS code for count row:
Proc sql;
select count(Bestell_Nr)
from WEST3.BESTELLUNGEN;
Results:

Sas code distinct value count:


Proc sql;
select distinct(Bestell_Nr)
from WEST3.BESTELLUNGEN;
result:

Here, total row showing 832 and distinct value is 830. Thats mean tow values are
duplicate. Lets find the duplicate values.

- 13 -

SAS code for duplicate value:

proc sql;
select Bestell_Nr from WEST3.BESTELLUNGEN
group by Bestell_Nr
having count(Bestell_Nr)>1 ;
Result:

Bestell no 10369 and 10830 are not unique values.

- 14 -

Summary

Distinct count and percent: Find unique value in dimension


Zero, blank, null percent: Find zero, blank, null value and make a rule for avoid this.
Minimum, Maximum string length and type: Find string length and type and set the proper
string length and type.
Numerical and date range analysis: Analyze numerical value with fraction needed or not.
Pattern analysis: Set different pattern for getting real data.
Quality screen: Column screen test, structure screen test, business rule. These tests will give
us more accurate data.

7. Conclusion
At the end, I want to say that data quality is a continual process. But if we analyze data
before building warehouse and use those techniques then it is possible to minimize data error
and increase data quality.

- 15 -

References
1. Data Vs Information, Available from:
http://www.diffen.com/difference/Data_vs_Information. [Accessed: 20th December
2014]
2. Rouse, M. (2015) Data Quality.[Online] Available from:
http://searchdatamanagement.techtarget.com/definition/data-quality. [Accessed: 11th
February 2015]
3. IBM, Data Quality, http://www-01.ibm.com/software/data/quality/ , [Accessed: 21th
December 2014]
4. Danette McGilvray, Granite Falls Consulting, Inc.Excerpted from Executing Data Quality
Projects, http://www.gfalls.com/storage/book/individual-downloads-quickth
ref/10steps_DQDimen.pdf , [Accessed: 25 December 2015]

5. Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., & Becker, B. (2008). The Data
Warehouse Lifecycle Toolkit. 2nd Edition: Wiley Publishing.
6. tdwi(3 February 2010), The Necessity of Data Profiling,
http://tdwi.org/Articles/2010/02/03/Data-Profiling-Value.aspx?Page=1, [Accessed:
27th December 2015]
7. Open data & Meta data quality. Authors: Makx Dekkers, Nikolaos Loutas, Michiel De
Keyzer and Stijn Goedertier
8. Data Quality Management. The Most Critical Initiative You Can Implement. Authors:
Jonathan G. Geiger, Intelligent Solutions, Inc., Boulder
9. http://tdwi.org/Articles/2010/02/03/Data-Profiling-Value.aspx?Page=1
10. Wikipedia : http://en.wikipedia.org/wiki/Data_quality

- 16 -