You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/241631252

Data quality: A survey of data quality dimensions

Article · August 2013


DOI: 10.1109/InfRKM.2012.6204995

CITATIONS READS
54 11,754

6 authors, including:

Fatimah Sidi Payam Hassany Shariat Panahy


Universiti Putra Malaysia University of Texas Health Science Center at Houston
76 PUBLICATIONS   386 CITATIONS    13 PUBLICATIONS   141 CITATIONS   

SEE PROFILE SEE PROFILE

Lilly Suriani Affendey Marzanah A. Jabar


Universiti Putra Malaysia Universiti Putra Malaysia
73 PUBLICATIONS   501 CITATIONS    62 PUBLICATIONS   346 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Job Scheduling in Mobile Grid Systems based on Divisible Load Theory (DLT) View project

Information Extraction from Web Pages View project

All content following this page was uploaded by Payam Hassany Shariat Panahy on 06 June 2014.

The user has requested enhancement of the downloaded file.


Data Quality:A Survey of Data Quality Dimensions
1
Fatimah Sidi, 2Payam Hassany Shariat Panahy, 1Lilly Suriani Affendey, 1Marzanah A. Jabar, 1Hamidah Ibrahim,
, 1Aida Mustapha

Faculty of Computer Science and Information Technology


University Putra Malaysia
1
{fatimacd, suriani, marzanah, hamidah, aida} @ fsktm.upm.edu.my
2
Payam_shp49@yahoo.com

Abstract— Nowadays, activities and decisions making in an organization. Likewise, lack of data quality in organizations
organization is based on data and information obtained from can be multiply in the Cooperative Information System
data analysis, which provides various services for constructing (CIS). In fact, (SIC) is an information system with capability
reliable and accurate process. As data are significant to distribute and share general objective between
resources in all organizations the quality of data is critical for interconnect different systems among of various independent
managers and operating processes to identify related organization in different geographical area as the data is
performance issues. Moreover, high quality data can increase basic recourses for it [6].
opportunity for achieving top services in an organization. Some researchers have identified and tested some
However, identifying various aspects of data quality from
effecting factors on data quality inside an organization with
definition, dimensions, types, strategies, techniques are
essential to equip methods and processes for improving data.
collecting data from survey and interview with senior
This paper focuses on systematic review of data quality manager and their results show that management
dimensions in order to use at proposed framework which responsibilities such as commitment improving data quality
combining data mining and statistical techniques to measure continually , effective communication among stakeholder and
dependencies among dimensions and illustrate how extracting understanding of data quality are significant elements for
knowledge can increase process quality. influencing data quality in an organization [1].
The rest of this paper discusses on data quality strategies
Keywords-Data Quality,Data Quality Dimensions,Types of and techniques, types of data, data quality definitions, data
Data, quality problems classification and data quality dimensions
to provide, fundamental issues in this field.
I. INTRODUCTION
In order to support organisation’s activity we should II. DATA QUALITY STRATEGIES AND TECHNIQUES
design the process activity appropriately since this involves
There are two types of strategies that one adapted for
data. Data is the primary foundation in operational, tactical
improving data quality namely data-driven and process-
and decisions making activities. As data are crucial resources
driven, and each strategy employs various techniques [2].
in all organizations, business and governmental application,
However, improving the quality of data is the aim of each
the quality of data is critical for managers and operating
technique.
processes to identify related performance issues [1], [2], [3].
There are variety of data quality issues from definition,
measurement, analysis and improvement which are essential A. Data-Driven
for ensuring high data quality [4]. As the various research Data-driven is strategy for improving the quality of data
shows if the process’s quality as well as information’s inputs
by modifying the data value directly. Some related
are not controlled, after a while, the degradation of the data
quality will be obvious [2]. improvement techniques of data-driven are: acquisition of
For improving process’s quality with enhanced efficiency new data, standardization or normalization, error
in production and administration, using process design is localization and correction, record linkage, data and
necessary for automation and management technology. It is schema integration, source trustworthiness, as well as cost
well known that both of them present, services to business optimization [2].
and individual user quickly and consistency [5]. Despite B. Process- Driven
availability of large variety of techniques for accessing and
improving data quality such as business rules, record linkage Process-driven is another strategy that redesigns the
and similarity measures, due to rise difficulty and process which is produced or modified data in order to
multiplicity of using these system, data quality methodology improve its quality. Process-driven strategy consists of two
has defined and provided [2]. main techniques: process control and process redesign. In
Data quality can provide various services for an fact, in the process control data will be check and manage
organisation as well as nowadays, high quality data can among the manufacturing process, while in the process
increase opportunity to achieve top services in an redesign the causes of low quality will be eliminated and

978-1-4673-1090-1/12/$31.00 ©2012 IEEE 300


new process will be added in other to producing high Another classification of data is based on strictness to
quality. Furthermore, adding an activity that can control measure and to achieve data quality, which has two class
format of data before storage is another fact in the process specifically elementary data and aggregated data. In an
redesign [2]. organization, data which managed by operational process
However, the advantages of Process-driven is better and represent atomic phenomena of the real world are called
performing than Data-driven techniques in long period, elementary data, (e.g., sex, age), While data which are
because they remove root causes of the quality problems collected from elementary data for applying aggregation
completely. In contrast, Data-driven is expensive than function, is called aggregated data, (e.g., average income
Process-driven in long period but it is efficient in short that tax payer paid in a specify city) [7].From point of view,
period [2]. data can be classified in different types based on their usage
in variety of field (e.g., network or web).
III. TYPES OF DATA
Data are real world objects, with ability of storing, IV. DATA QUALITY DEFINITIONS
retrieving and elaborating through a software process and Data quality has different definition on different field and
can communicate via a network [7]. Researchers have period. Researcher and expert made different understanding
provided different classification for data in different area. about data quality. According to quality management data
As implicitly or explicitly, three types of data are described quality is appropriate for use or to meet user needs or it is
in the field of DQ [2] .Table I presents types of data based quality of data to meet customer needs [9].Also, another
on this classification. definition for data quality is fitness for use. Indeed, quality of
data is critical for improvement process activity as it can be
A second classification of data is based on considering addressed in different field including management, medicine,
statistics and computer science. The widespread collection of
data as a product, this model classify data in to three types.
definition through data quality may give opportunity to better
Table II shows this classification.
understand the nature of data process.

V. DATA QUALITY PROBLEMS CLASSIFICATION


TABLE I. TYPE OF DATA SEE DATA AS A IMPLICITY Data quality problem generally can be divided in to two
OR EXPLICITY classes that are single-source and multi-source problem.
Types of Definition Example
According to some research four categories for data quality
Data are identified which are shown as the following table.
Structured data Generalization or aggregation of Relational tables As a result, the goal of classifying data quality problem is
items described by elementary Statistical Data illustrating non-standard data and identifying exact
attributes defined within a domain. application of data for corresponding requirements [10].

TABLE III. DATA QUALITY PROBLEMS


Unstructured A generic sequence of symbols, Body of an Email CLASSIFICATION
data typically coded in natural Questionnaire Data quality Category Definition
language. with free text problem
answering Schema Lack of integrity constraints, poor schema
level designer
Semi structure Data that have a structure with Mark up
Uniqueness constraints
data some degree of flexibility. language, XML
Referential integrity
Single -source
problem
Instance Data entry errors
TABLE II. TYPE OF DATA SEE DATA AS A PRODUCT level Misspelling
Redundancy Duplicates
Types of Data Definition Contradictory values
Schema Heterogeneous data models and schema
Raw data items Smaller data unites which are used to create level design
information and component data items Multi-source Naming Conflicts
Data is constructed from raw data items and problems
Component data items stored temporarily until final product is Instance
manufactured level Overlapping contradicting and
inconsistence data
Information products Data ,which is the consequence of performing Inconsistent aggregating
manufacturing activity on data Inconsistent timing

301
VI. DATA QUALITY DIMENSIONS AND DEFINITION Types of Data Definition

Table IV illustrate some data quality dimensions and their Consistency The extent to which data is presented in the
definition from literature. From the research perspective, same format and compatible with previous data
[15].
there is various numbers of dimensions for Information
Quality and Data quality. In fact, “Data Quality”, Refer to the violation of semantic rules defined
“Information System” and “accounting and auditing” are over the set of data [2].
three initial categories for identifying proper DQ dimensions Accuracy Data are accurate when data values stored in the
[11]. In the field of Data Quality ,Wang [11] determined database correspond to real-world values [2, 19].
four categories that are Intrinsic DQ, Accessibility DQ,
Contextual DQ, Representational DQ and fifteen The extent which data is correct, reliable and
dimensions for DQ/IQ (e.g., objectivity, believability, certified [15].
reputation, value added). Other researcher recognized extra Accuracy is a measure of the proximity of a data
dimensions for DQ such as data validation, credibility, value, v, to some other value, v’, that is
traceability, availability for identifying. In the area of considered correct [2, 17].
Information Systems, researcher identified different factors
such as reliability, precision, relevancy, usability, and A measure of the correction of the data (which
requires an authoritative source of reference to
independency. In the accounting and auditing, researcher be identified and accessible [14].
explained that accuracy, timeliness and relevance are three Completeness The ability of an information system to represent
data quality dimensions. In addition, in this area some every meaningful state of the represented real
scholars explained that internal control systems need lowest world system [2, 11].
cost and highest reliability which refers to some dimensions
such as accuracy, frequency and size of data [12]. The extent to which data are of sufficient
breadth, depth and scope for the task at hand
Base on the ISO standard, quality means the totality of [15].
the characteristics of an entity that bear on its ability to
satisfy stated and implied needs [13]. The degree to which values are present in a data
A Data Quality Dimension is a characteristic or part of collection [2, 17].
information for classifying information and data
Percentage of the real-world information entered
requirements. In fact, it offers a way for measuring and in the sources and/or the data warehouse [2, 18].
managing data quality as well as information [14].
So, primary step for understanding data quality Information having all having all required parts
dimension can help us to improve it. Analyser and developer of an entity’s information present [2, 16].
use dimension and taxonomy of separate data via using data
Ratio between the number of non-null values in
quality tools for creating and manipulating the information in
a source and the size of the universal relation [2,
order to improve information and its process. 20].
All values that are supposed to be collected as
per a collection theory [2, 21].
TABLE IV. TABLE DATA QUALITY DIMENSIONS Accessibility Extent to which information is available, or
easily and quickly retrievable [15].
Dimension Definition
Duplication A measure of unwanted duplication existing
Timeliness The extent to which age of the data is within or across systems for a particular field,
appropriated for the task at hand [15]. record, or data set [14].
Data A measure of the existence, completeness,
Timeliness refers only to the delay between a specification quality and documentation of data standards,
change of a real world state and the resulting data models, business rules .meta data and
modification of the information system state [2, reference data [14].
11].
Timeliness has two components: age and Presentation A measure of how information is presented to
volatility. Age or currency is a measure of how Quality and collected from does how utilize it. Format
old the information is, based on how long age it and appearance support appropriate use of
was recorded. Volatility is a measure of information [14].
information instability the frequency of change Consistent To extend to which data is presented in the same
of the value for an entity attribute [2, 16]. Representation format [22].
Reputation To extent to which information is highly
Currency Currency is the degree to which a datum is up- regarded in terms of source or content [15].
to-date. A datum value is up-to-date if it is
correct is spite of possible discrepancies caused Safety It is the capability of the function to achieve
by time-related changes to the correct value [2, acceptable levels of risk of harm to people,
17]. process, property or the environment [13].
Currency describes when the information was
entered in the sources and/or the data Appropriate To extend to which data volume of data is
warehouse. Volatility describes the time period amount of data appropriate for the task at hand [22].
for which information is valid in the real world
[2, 18].

302
Types of Data Definition Types of Data Definition

Security Extent to which access to information is


Navigation Extent to which data are easily found and linked
restricted appropriately to maintain its security
to [23].
[15].

Believability Extent to which information is regarded as true Useful Extent to which information is applicable and
and credible [15]. helpful for the task at hand [15].
Understandabili Extent to which data are clear without ambiguity
ty and easily comprehended [15]. Efficiency Extent to which data are able to quickly meet the
To extend to which data is easily comprehended information needs for the task at hand [15].
[22].
Objectively Extent to which information is unbiased, Availability Extent to which information is physically
(Objectivity) unprejudiced and impartial [15]. accessible [23].

Relevancy Extent to which information is applicable and Data Coverage A measure of the availability and
helpful for the task at hand [15]. comprehensiveness of data compared to the total
data universe or population of interest [14].
Effectiveness It is the capability of the function to enable to
users to achieve specified goals with accuracy Transactability A measure of the degree to which data will
and completeness in a specified context of use produce the desired business transaction or
[2]. outcome [14].
Interpretability To extend to which data is appropriate
languages, symbols, and units and the definition Timeliness and A measure of the degree to which data are
are the clear [22]. Availability current and available for use as specified and in
the time frame in which they are expected [14].
Ease of To extend to which data is easy to manipulate
Manipulation and apply to different same format [22].

Free-of –error To extend to which data is correct and reliable VII. DISCUSSION
[22].
Ease of Use and A measure of the degree to which data can be Most people think the quality of data is depended only
maintainability accessed and used and the degree to which data
can be updated, maintained, and managed [14]. to its accuracy and they do not consider and analyze other
Useability To extent to which information is clear and significant dimensions for achieving higher quality. Indeed,
easily used [23]. quality of data is more than considering one dimension so,
Reliability Extent to which information is correct and the issue of dimensions’ dependencies is essential to
reliable [15].
improve process quality in different domain and
It is the capability of the function to maintain a applications. Nevertheless, without knowing the existing
specified level of performance when used on relations between data quality dimensions, knowledge
specified condition [13]. discovery cannot be effective and comprehensive for
Amount of data To extent to which the quantity or volume of
decision making process. From previous work found out,
available data is appropriate [15].
not only dimensions can be strongly related to each other
Freshness Freshness represents a family of quality factors but also, data quality can be supported via the effective
which each one representing some freshness dependencies [25].In fact, select appropriate dimensions
aspect and having on its metrics [24]. with identifying correlation among them can create high
Value added To extent to which information is beneficial,
provides advantages from its use [15].
quality data. In order to discover dependencies among more
Learn ability It means the capability of the function to enable commonly referenced dimensions consist of accuracy,
to user to learn it [13]. currency, consistency and completeness, we proposed
framework which combining data mining and statistical
Data Decay A measure of the rate of negative change to data techniques to measure dependencies among dimensions and
[14].
illustrate how extracting knowledge can increase process
Concise Extent to which information is compactly quality. So, based on our hypothesis if there is a correlation
represented without being overwhelming (i.e. between completeness, consistency and accuracy
brief in presentation, yet complete and to the dimensions which are considered independent variable and
point) [15].
then, consider currency correlation as dependent variable
Consistency and A measure of the equivalence of information among them, improvement in data quality will be happened.
Synchronization used in various data stores, applications, and Also, cause of some difficulties on currency dimension the
systems, and the processes for making data policy is required.
equivalent [14].
Data integrity A measure of the existence, validity, structure,
Fig.1 illustrate proposed framework for evaluating the
fundamentals content, and other basic characteristics of the effect of independent dimensions on dependent dimensions.
data [14].

303
FIG.1 FRAMEWORK [5] F. Casati, M.C. Shan, M. Sayal, "Investigating business processes,"
ed: Google Patents, 2009.
Select Accuracy [6] M. Mecella, M. Scannapieco, A. Virgillito, R. Baldoni, T. Catarci, C.
Independent Completeness Batini, "Managing data quality in cooperative information systems,"
Consistency
Variable On the Move to Meaningful Internet Systems 2002: CoopIS, DOA,
and ODBASE, pp. 486-502, 2002.
Data Quality
Dimensions Applying Data Quality [7] C. Batini and M. Scannapieca, Data quality: Concepts, methodologies
Technique Improvement and techniques: Springer-Verlag New York Inc, 2006.
Select
Dependent Currency
[8] V. Peralta, "Data quality evaluation in data integration systems,"
Variable Université de Versailles (chair) Raúl RUGGIA Professor,
Universidad de la República, Uruguay, 2008.
[9] F. G. Alizamini, M.M. Pedram, M. Alishahi, K. Badie, "Data quality
improvement using fuzzy association rules," 2010, pp. V1-468-V1-
472.
[10] Y. Man, L. Wei, H. Gang, G. Juntao, "A noval data quality
So, the aim of the proposed framework is discovering the controlling and assessing model based on rules," 2010, pp. 29-32.
dependency structure for the assessed data quality
dimensions. [11] Y. Wand and R. Y. Wang, "Anchoring data quality dimensions in
ontological foundations," Communications of the ACM, vol. 39, pp.
VIII. CONCLUSION 86-95, 1996.
[12] KQ. Wang, SR. Tong, L. Roucoules, B. Eynard, "Analysis of data
From the perspective research, many scholars have quality and information quality problems in digital manufacturing,"
identified various methodology and framework for assessing 2008, pp. 439-443.
and improving data quality through different techniques and [13] M. Heravizadeh, J. Mendling, M. Rosemann, "Dimensions of
strategies on the data quality dimensions [2].They illustrated business processes quality (QoBP)," 2009, pp. 80-91.
definitions for dimensions and identified more important [14] D. McGilvray, Executing data quality projects: Ten steps to quality
data and trusted information: Morgan Kaufmann, 2008.
data quality dimensions [2], [11], [12], [22]. Existing
[15] R. Y. Wang and D. M. Strong, "Beyond accuracy: What data quality
survey identified forty data quality dimension since 1985 till means to data consumers," Journal of management information
2009. Since, some dimensions such as timeliness, currency, systems, vol. 12, pp. 5-33, 1996.
accuracy and completeness are more referenced than others, [16] M. Bovee, R.P. Srivastava, B. Mak, "A conceptual framework and
the result of this survey will be used to find correlations belief‐function approach to assessing overall information quality,"
International journal of intelligent systems, vol. 18, pp. 51-74, 2003.
among data quality dimension based on proposed
[17] T. C. Redman, Data quality for the information age: Artech House,
framework with combining data mining and statistical 1996.
techniques for measuring dependencies among them and [18] M. Jarke, Fundamentals of data warehouses: Springer Verlag, 2003.
illustrate how process quality will be increased via the [19] D. P. Ballou and H. L. Pazer, "Modeling data and process quality in
extracting knowledge. Specifically, our future work would multi-input, multi-output information systems," Management science,
be to evaluate dependency among mentioned data pp. 150-162, 1985.
quality dimensions for improving process quality. [20] F. Naumann, Quality-driven query answering for integrated
information systems vol. 2261: Springer Verlag, 2002.
[21] L. Liu and L. Chi, "Evolutionary data quality," 2002.
REFERENCES [22] L. L. Pipino, Y.W. Lee, R.Y. Wang, "Data quality assessment,"
Communications of the ACM, vol. 45, pp. 211-218, 2002.
[1] S. W. Tee, P.L. Bowen, P. Doyle, F.H. Rohde, "Factors influencing [23] S. Knight and J. Burn, "Developing a framework for assessing
organizations to improve data quality in their information systems," information quality on the world wide web," Informing Science:
Accounting & Finance, vol. 47, pp. 335-355, 2007. International Journal of an Emerging Transdiscipline, vol. 8, pp. 159-
172, 2005.
[2] C. Batini, C. Cappiello, C. Francalanci, A. Maurino, "Methodologies
for data quality assessment and improvement," ACM Computing [24] V. Peralta, "Data quality evaluation in data integration systems,"
Surveys (CSUR), vol. 41, p. 16, 2009. Université de Versailles (chair) Raúl RUGGIA Professor,
Universidad de la República, Uruguay, 2008.
[3] W. Eckerson, "Data Warehousing Special Report: Data quality and
the bottom line," Applications Development Trends May, 2002. [25] D. Barone, et al., "Dependency discovery in data quality," 2010, pp.
53-67.
[4] Y.Y.R. Wang, R.Y. Wang, M. Ziad, Y.W. Lee, Data quality vol. 23:
Springer, 2001.

304

View publication stats

You might also like