You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/335101436

A Framework for Data Quality in Data Warehousing

Chapter · August 2019

CITATIONS READS
0 1,464

2 authors:

Rao Nemani Ramesh Konda


College of St. Scholastica International Technological University
16 PUBLICATIONS   60 CITATIONS    7 PUBLICATIONS   86 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Rao Nemani on 10 August 2019.

The user has requested enhancement of the downloaded file.


A Framework for Data Quality in Data Warehousing
Rao R. Nemani Ramesh Konda
University of St. Thomas, Minneapolis, MN, USA University of Phoenix, Phoenix, AZ, USA
Nema8811@stthomas.edu Konda1991@yahoo.com

Abstract addressed. Ensuring high-level DQ is one of the most


expensive and time-consuming tasks to perform in data
Despite the rapid growth in development and warehousing projects. Many data warehouse projects
use of Data Warehousing (DW) systems, the data have failed halfway through due to poor DQ. This is
quality (DQ) aspects are not well defined and often because DQ problems do not become apparent
understood. Organizations rely on the information until the project is underway. Any changes to DW at the
extracted from their DW for their day-to-day as well as implementation stage cost a substantial amount that may
critical and strategic business operations. Creating and push project budget limits. If all the considerations are
maintaining a high level data quality is one of the key examined thoroughly at the strategy and design stage of
success factors for DW systems. This article examines DW, controls can be formulated into design for DQ that
the current practices of DQ, and proposes a research can decrease operational costs, increase customer
and experienced-based framework for DQ in DW, which satisfaction, and improve effective decision-making and,
describes an approach for defining, creating, and employee confidence in using the data (Andrea &
maintaining DQ management within DW environment. Miriam, 2005). The quality of information systems (IS)
is critically important for companies to derive a return
on their investments in IS projects, and the DW is no
1. Introduction different in that sense. Therefore, developing good
. quality IS that meets user needs is becoming a critical
In today’s global and competitive business theme for information technology management
environment, organizations rely on data and information (Guimaraes, Staples & McKeen, 2007).
to make timely decisions and to offer innovative
services to meet changing needs of their customers. In this paper, the authors examine the current DQ
Typically, organizations collect data through various practices in DW, and propose a framework for
systems and try to create coherent and aggregate data improving the quality of data in DW environment. In
for decision-making purpose. Bryan (2002) states that the Section 2, the problem statement around which this
fast growth of electronic commerce in the last decade paper offers a solution is discussed. In Section 3, a brief
registered a lot of business critical and confidential data literature review is discussed. Following which in
being exchanged online among companies and Section 4, a framework for the DQ in DW is presented.
customers. At the same time, rapid advancements in Section 5 describes the typical use case. The last section
storage technology made it possible for the summarizes the paper.
organizations to store vast amounts of data is collected
at a relatively low cost. However, this data failed to 2. Problem Statement
provide knowledge because of its raw form. To solve
this problem, organizations adopted a transformation Literature review related to issues with DW reveal
process called a data warehouse, which is defined as a that DQ is one of the most prominent issues. It is not
“collection of subject-oriented, integrated, non-volatile known how and to what extent the DQ issues are
data that supports the management decision process”. affecting the DW implementation. Poor data quality
Typically, a data warehouse contains five types of data: increases operational costs, adversely affects customer
current detail data, older detail data, lightly summarized satisfaction, and has serious ramifications on the society
data, highly summarized data, and metadata. Nowadays, at large. Some examples of poor DQ can be seen in
data warehousing has become an integral part of both Nord’s work (2005) that describe that billions of dollars
business and Information Technology strategies to have lost due to the fraud in food stamps wherein some
integrate heterogeneous information sources, and to of the recipients received the benefits even after they
enable On-Line Analytic Processing (OLAP) and were long dead. Further Nord states that due to
Decision Support Systems (DSS). inaccurate and outdated information in data systems,
organizations have experienced financial losses. Data
There has been great progress and improvement quality problems range from a minor insignificant to
in core technology of DW; however the DQ aspects are major problematic issues (Bielski, 2005).
one of the crucial issues that were not adequately
typographical errors and transpositions, and variations
3. Literature Review in spelling or naming, b) data missing from database
fields, c) lack of company-wide or industry-wide data
The science of DQ is yet to be advanced to the point coding standards, d) multiple databases scattered
where standard measurement methods can be devised throughout different departments or organizations, with
for any of these issues. Inappropriate, misunderstood, or the data in each structured according to the rules of that
ignored DQ has a negative affect on business decisions, particular database and e) older systems that contain
performance and value of data warehouse systems. poorly documented or obsolete data (Andrea & Miriam,
English (2001) argues that managing quality of your 2005). Nord (2005) mentioned that the DQ has become
information is equally important as managing your an increasingly critical concern and it has been rated as
business. English (2001) listed several examples in his a top concern to data consumers in many organizations.
paper that draw attention to the negative impact of the Nord (2005) continued stating that the data quality is
DQ issues in DW. Some of them include errors in gaining its importance in the research and among the
students Basic Standards Test scores, pension consumer organizations.
withholdings, invoicing, and food processing that led to
the loss of billions of dollars as well as loss of Above discussion indicates that DQ issues are
reputation of those businesses. critical and need to be addressed in order to be
successful with DW environment.
In a survey by Friedman, Nelson, and Radcliffe
(2004), it was stated that 75 percent of survey 4. A Framework for Data Quality in Data
respondents reported significant problems stemming Warehousing
from defective and fragmented data, over 50 percent
have incurred cost for data reconciliations, and 33 Iain and Don (2005) argue that in order to tackle this
percent were delayed IT systems owing to data quality difficult issue, organizations need both a top-down
problems. In another survey, Ambler (2006) reported approach to DQ sponsored by the most senior levels of
several metrics from the response that indicate Data management and a comprehensive bottom up analysis of
Quality has been the major issue and requires data sourcing, usage and content including an
considerable attention to solve this problem. For assessment of the enterprise' s capabilities in terms of
example, the following chart illustrates only 2 percent of data management, relevant tools and people skills. Xu,
the respondents feel good about the data quality in their Nord and Brown (2002) believe that for organizations
data warehousing and the remaining 98 percent indicate considering implementing of DW, it is essential that DQ
some kind of data quality issues that need be addressed. issues be thoroughly understood and the organizations
should obtain knowledge of the critical success factors
essential to ensure DQ during the implementation
process. However, the crucial question of defining data
quality is often ignored until late in the process. Jean-
Pierre (2004) believes that this could be due to the lack
of solid methodology to deal with DQ. Quality is a
relative statement and varies by individuals based upon
their perceptions. In simplistic data quality is perceived
as “true and accurate”. This makes DQ hard to define
and measure. To understand how to tackle the problem,
DQ needs to be understood thoroughly from the
Figure 1. Current State of Data Quality (Ambler, 2006) organizations point of view, and then a process can be
established to deal with DQ within their organization.
One of the major factors of influencing the DQ is
user perception over time. If user assumptions or The main components of data that determines the
perceptions are unchecked, then it starts to become ‘the DQ are, completeness, appropriateness, accuracy,
truth’ whether or not it has an objective or factual basis, grouping accuracy, access, confidence, currency,
from both business and technical perspectives (Bryan, regulators and legal compliance and meta-linking. Data
2002). DQ indicates how well enterprise data matches interface, data replication and data migration and
up with the real world at any given time. movement all share common characteristics such as
volume of data, timeliness of movement and processing,
There are many sources of ' dirty data', which direction of flow between sources and targets (Bryan,
includes a) poor data entry, which includes misspellings, 2002).
DQ tools generally fall into one of three monitoring program.
categories: auditing, cleansing and migration. Data
auditing tools apply predefined business rules against a Nelson, Todd and Wixom (2005) had developed a
source database. These tools enhance the accuracy and model consisting of nine fundamental determinants of
correctness of the data at the source. Some of the data quality in an information technology context, four under
cleansing tools compare the data against an independent the rubric of information quality and five that describe
source e.g. US Postal Codes for verifying the data. Data system quality. Their model strikes a balance between
is typically moved from the source to intermediate comprehensiveness and parsimony in data warehouse
staging area where the data cleansing activities are environments.
performed. Data migration is an activity where data is
extracted and transported from one source to another. Theodoratos and Bouzeghoub (2001) had presented
Data migration tools perform the activity of extraction, a framework with a high level approach that allows
transportation and mapping for data from one platform checking whether a view selection guaranteeing a data
to another. Poor DQ will impacts the typical enterprise completeness quality goal also satisfies a data currency
in many ways such as customer dissatisfaction, quality goal. So these authors have used a view to
increased cost, and lowered employee job satisfaction. accomplish the data quality requirements in a DW
The slightest suspicion of poor DQ often hinders environment.
managers from reaching any decision. In order to ensure
DQ assessment in DW, Hufford (1996) proposed a Data warehousing depends on integrating data
model which consists of defining DQ expectations and quality assurance into all warehousing phases—
metrics, identifying and assessing risks, mitigating risks, planning, implementation, and maintenance (Ballou &
and monitoring and evaluating results on an on-going Tayi, 1999). Practitioners in quality control
basis. methodology recommend addressing the “root cause”
duly considering the following data quality factors:
Xu et al., (2002) agree with the notion that DQ
means accurate, timely, complete, and consistent data. 1) Accuracy
Xu et al used the terms DQ and information quality 2) Completeness
synonymously. Data quality is influenced by a number 3) Timeliness
of factors. Several studies have shown that success of 4) Integrity
information system and quality is dependent on TQM 5) Consistency
and JIT (Xu et al., 2002). 6) Conformity
7) Record Duplication
Using TQM philosophy, it can described that data
quality management is a set of policies, procedures, and Based on the literature review and from our personal
follow-up actions that occur over the complete life cycle experience, we propose a comprehensive version of
of data starting from data generation and conversion of Data Warehouse Development Life Cycle (DWDLC)
data into information through the archival or discarding Layers, which lists comprehensive phases and links the
of data. It consists of two major components: the data DQ factors as follows.
content itself and the accompanying infrastructure
(Vikram and Sreedhar, 2006). It is important that data in
the data warehouse reflects correctly the real world, but
it is also very important that data can be easily
understood. Quality factors such as accessibility and
timeliness, believability and understandability, design
and usage flexibility play a crucial role in the success of
data warehousing.

Vikram & Sreedhar (2006) proposed a nine steps


approach to for successful deployment of a DQ program
for a DW initiative. The nine steps include identifying
data elements, defining data-quality measurement,
instituting the audit measure, defining target metrics for
each data attribute, deploying the monitoring program, Figure 2. Data Warehouse Development Life Cycle
finding gaps, automating the cleansing effort, (DWDLC) Layers
developing procedures and establishing a continuous
The major theme in each of the above presented Data
Warehouse Development Life Cycle (DWDLC) Layers
can be described as follows:

1) Planning: Apart from DQ project success, it is


evident that by defining and managing the project scope
influences the project’s overall success. Every DW
project requires a careful balance data sources,
processes, procedures, and other factors are scoped as
commensurate with the project’s size, complexity, and
importance. Figure 3. Data Quality in Data Warehousing is a four-
pronged approach
2) Analysis: In this layer, one should consider analyzing
the data from various available data sources. In this Each of the prongs in Figure 3 defines specific
phase it is recommended to perform the data profiling of functional area. As the arrows pointed in the above
the data, which is considered for this DW project. figure, quality in each of these functional areas can be
assured via on-going audits and continuous
3) Requirements: In this layer, DW professional will improvement efforts. Each of the prongs is defined as
collaborates with the business stakeholders to follows in a broader sense.
understand the business problem and define and
document the required data quality factors for the DW Basics – The very first step in any IT system is to
project. ensure data consistency and completeness. In this case,
one may examine individual systems and data sources to
4) Develop: In this layer, the DW professional will ensure the data is complete in the sense that there is no
develop and test the DW solution keeping in mind the mission data in the fields and has the valid data values.
DQ factors defined in the requirement phase. This can be achieved using triggers and constraints from
the database point of view. One could also check and
5) Implement: In this layer, the DQ solution will be monitor their extract-transform-load (ETL) work flows
implemented after duly signed off by the quality for monitoring time to load, stalled and suspended jobs
assurance team. due to soft and hard errors, and completeness of all data
fields. Additionally, redundancies in the data can be
6) Measure: In this layer, a data sampling is done and a used as one of the metrics to minimize it.
measure to understand current process capability is
worked out on DQ factors defined in the requirements Truth – Correctness of the data is considered as other
phase. This activity will ensure to minimize the data side of the coin that shares completeness of the data.
quality problem. One of the major strengths of DW is single point of
truth. Using the data from their DW, organizations drive
Additionally, in order to achieve the above phases, we day-to-day as well as long-term strategic business
propose a four-prong DQ management model for activity. As a part of the quality checks and strategy, one
defining and ensuring data quality in a DW should consider use cases and historical sample data in
environment, as described below. It is believed that order to evaluate the correctness of the data. This should
under each prong, many relevant tasks need to be be an on-going activity. Additionally, establishing a data
defined to achieve required DQ. Also, appropriate governance function that overlooks the data fields’
metrics should be developed to measure the definitions and sample values is a key in ensuring the
effectiveness of implementation of tasks that were data quality. The governance function would also help
defined under each prong. in standardizing the data and the verification process
across the enterprise. Data Governance guidelines can
also be helpful in developing a process for adding or
removing new data fields or tables. More specifically,
when adding a new field, the definition of the fields
could be verified with the existing metadata to prevent
redundancy.

Coherent – Main premises of DW is merging the data


from disparate sources. In this process, it is critical to
build the coherent data using dimension keys. Ensuring 1) Planning: As we discussed before, in this layer, we
data coherency is critical for OLAP analysis as well as have to scope various functions of the business unit' s.
building aggregates. The use cases can be critical for So these five functional managers need to plan and
verifying the data coherency. Missing values for foreign scope what reports they are interested to get from the
keys and/or joining keys are a major issue falls under DW.
this criterion, which need to be monitored.
2) Analysis: In this layer, these five functional
Audit and Continuous Improvement – Plan, Do, managers should consider analyzing the data from
Check, and Act (PDCA) process can be used in this various available data sources. Also, their technical /
stage. As an independent program, a frequent and business representatives will perform the data profiling
automated audit of data completeness, accuracy, and of the data, which is considered for this DW project.
coherency will be critical in finding the gaps. Typically,
DW would contain huge volume of data, which is very 3) Requirements: In this layer, these five functional
time consuming to audit by means of manual methods. managers will collaborates with the DW professionals to
It is strongly recommended to develop customized and understand the business problem and define and
automated programs to audit and monitor as much as document the required data quality factors for the DW
possible. Raising triggers when an unusual is seen in the project. By doing this, the data quality issues that they
data will be an important function part of the audit. have recognized will be minimized if not completely
Swiftly acting on the triggers and finding the root causes eliminated.
and appropriate fixes would like to continuous
improvement of DW’s ability to be a beneficial systems 4) Develop: In this layer, the DW professionals will
for the organization. develop and test the DW solution keeping in mind the
DQ factors defined by these five functional managers or
5. Use Case their representatives in the requirement phase.

The above described Data Warehouse Development 5) Implement: In this layer, the DQ solution will be
Life Cycle (DWDLC) Layers in combination with four- implemented after duly signed off by the quality
pronged Data Quality model can be used to address assurance team.
most of the Data Quality issues. The following use-case
represents some of the scenarios how the above model 6) Measure: In this layer, a data sampling is done and
can be used to address data quality. a measure to understand current process capability is
worked out on DQ factors defined in the requirements
Consider this scenario: Five functional business phase. This activity will ensure to minimize the data
managers, each representing a different business quality problem.
function walks into an important business strategic
planning meeting. Every one is carrying comprehensive So adhering to the above process will surely
reports about their business functions performance. minimize if not totally eliminate the data quality issues.
Each manager is prepared to make some strategic
suggestions based on the reports in hand. They have all 6. Conclusions
recognized in less than an hour, their reports reflect
entirely different numbers, because the reports are not Experience suggests that one solution does not fit all;
complied from a common set of data; no one is sure rather the DQ assessment is an on-going effort that
which, if any, set of numbers are accurate to consider requires awareness of the fundamental principles
for the strategic planning. This has resulted in underlying the development of subjective and objective
postponing the important decision and also initiating DQ metrics. In this article, the authors have presented
crucial initiatives. an approach that combines the subjective and objective
assessments of DQ, and demonstrated how the approach
The above scenario is a representation of one of the can be defined effectively in practice.
issues faced by organizations across the globe. As it
indicates, the inconsistency of data and other quality The goal of any DW and DQ programs is to provide
issues of data from which Business Intelligence (BI) decision makers with clean, consistent and relevant data.
reporting are generated are some of the major concerns. Data Warehouses should provide a “single version of
Using the proposed DWDLC model, the above scenario the truth” of high quality data; this enables employees to
can be addressed in each layer as follows: make informed and better decision while a low quality
data has severe effect on organization performance.
A high quality data warehouse increases trust and
reliability of data of various applications like data [11] Nord, G. D. (2005), “An Investigation of the Impact of
mining and its associated data-reduction techniques. In Organization Size on Data Quality Issues”, Journal of
addition, the trends as identified in the DW can be used Database Management, Vol. 16, No. 3, pp. 58-71.
to ensure optimal inventory levels, high quality Website [12] Orr, O. (1996), “Data Quality and Systems Theory”,
design and to detect possible fraudulent behavior. This, Communications of the ACM, Vol. 41, No. 2, pp. 66-71.
in turn, should lead to improved customer satisfaction
and an increase in market share. [13] Payton, F. C. & Zahay, D. (2005), “Why doesn' t
marketing use the corporate data warehouse? The role of trust
With the support and commitment from the top-level and quality in adoption of data-warehousing technology for
management and by employing the data quality model CRM applications”, The Journal of Business & Industrial
and strategy proposed in this paper, the authors Marketing, Vol. 20, No. 4/5, pp. 237-244.
confident that an effective data quality can be achieved
[14] Theodoratos, D., & Bouzeghoub, M. (2001). “Data
in a DW environment. Currency Quality Satisfaction in the Design of a Data
Warehouse”, International Journal of Cooperative Information
7. References Systems, 10(3), p. 299

[1] Ambler, S. W. (2006). “Data quality survey results”, [15] Vikram, R., & Sreedhar, S. (2006), “Data Quality for
www.ambysoft.com/downloads/surveys/DataQuality200609.p Enterprise Risk Management”, Business Intelligence Journal,
pt, accessed on Oct 27, 2008. Vol. 11, No. 2, pp. 18-20.

[2] Andrea, R., & Miriam, C. (2005), “Invisible data quality [16] Xu, H., Nord, J.H., Brown, N., Nord, G.D. (2002), "Data
issues in a CRM implementation”, Journal of Database quality issues in implementing an ERP", Industrial
Marketing & Customer Strategy Management, Vol. 12, No. 4, Management & Data Systems, Vol. 102, No.1, pp. 47-60.
pp. 305-314.

[3] Ballou, D., & Tayi, G. (1999, January). “Enhancing data


quality in data warehouse environments”, Communications of
the ACM, 42(1), pp. 73-78.

[4] Bielski, L. (2005), “Taking notice of data quality: as DQ


discipline goes enterprise-wide, even the "C suite" is getting
involved”, Banking Journal, Vol. 97, No. 12, pp. 41-46.

[5] Bryan, F. (2002), “Managing the quality and completeness


of customer data”, Journal of Database Management, Vol. 10,
No. 2, pp. 139–158.

[6] English, L.P. (2001), "Information quality management:


the next frontier", Annual Quality Congress Proceedings,
American Society for Quality, Milwaukee, WI, pp.529-33.

[7] Friedman, Nelson, and Radcliffe (2004), “CRM demands


data cleansing”, Gartner Research, December 2004.

[8] Guimaraes, T., Staples, D.S., & McKeen, J.D. (2007),


“Assessing the Impact from Information Systems Quality”,
Quality. Management Journal, Vol. 14, No. 1, pp. 30-44

[9] Iain, H., & Don, M. (2005), “Prioritizing and deploying


data quality improvement activity”, Journal of Database
Marketing & Customer Strategy Management, Vol. 12, No. 2,
pp. 113

[10] Nelson, R., Todd, P., & Wixom, B. (2005). “Antecedents


of Information and System Quality: An Empirical
Examination Within the Context of Data Warehousing”,
Journal of Management Information Systems, 21(4), pp. 199-
235.

View publication stats

You might also like