Synoptic Analysis of Contemporary Data Cleansing Issues and Causes

Riphah Institute of Computing and Applied Sciences [RICAS], Lahore Campus 1
Table of Contents
ABSTRACT.................................................................................................................................................. 4
Key words: .................................................................................................................................................... 4
1. INTRODUCTION .............................................................................................................................. 4
1.1 DATA QUALITY........................................................................................................................ 5
1.2 DATA CLEANSING ................................................................................................................. 5

2 PROBLEMS FACED DURING DATA CLEANSING .................................................................. 6
2.1 High Volume of Data:...................................................................................................................... 6
2.2 Misspellings: ..................................................................................................................................... 6

2.3 Lexical Errors: ................................................................................................................................... 6
2.4 Misfielded Value: ............................................................................................................................. 7

2.5 Domain Format Errors: ................................................................................................................... 7
2.5 Irregularities ..................................................................................................................................... 7
2.7 Missing values .................................................................................................................................. 7
2 CHALLENGES IN DATA CLEANSING............................................................................................ 8

3 DATA CLEANSING PROCEDURE ................................................................................................... 9
Fig. 1 Data Cleansing Procedure .......................................................................................................... 9
4 STAGES WHERE PROBLEMS ARISE ............................................................................................ 10

5.1 Data Acquisition Process ......................................................................................................... 10
The data is entered into the company's systems in several ways, including manual and
automatic. Data migration in large volumes introduces serious errors. Many factors
contribute to the poor quality of incoming data .......................................................................... 10
5.2 Initial Data Conversion .............................................................................................................. 10
When data is migrated from one existing database to another, many quality issues can
arise. Source data may itself be incorrect because of its own limitations; or mapping the old
database to the new database may have inconsistencies or the conversion routines may
map them incorrectly. We often see that "legacy" systems have metadata that are not
synchronized with the actual schema implemented. The accuracy of the data dictionary
serves as a basis for conversion algorithms, mapping and effort. If the dictionary and actual
data are out of sync, this can lead to major data quality issues................................................. 10
5.3 System Consolidation.................................................................................................................. 11
Mergers and acquisitions of companies lead to consolidation of data. Since the focus is
primarily on streamlining business processes, the combination of data is usually less
important. This can be catastrophic, especially if the data experts from the previous
company are not involved in the consolidation process, and desynchronized metadata is a
problem. Merging two databases that do not have compatible fields can result in data
misassembly and compromise the accuracy of the data. ........................................................... 11
5.4 Manual Data Entry ..................................................................................................................... 11
6 SIGNIFICANCE OF THIS RESEARCH ........................................................................................... 13
6.1 Let Business Drive Data Quality The main goal of Data is to fuel business. Rather than
letting IT keep the reins of data quality, the business units that are the primary users of this
data are better equipped to define the data quality parameters. If business intelligence is
closely tied to the underlying data, there is a better chance of adopting effective
methodologies that focus on strategic priority data. 6.2 Appoint Data Stewards.................... 14
These are roles created specifically to define the owners of data quality. Data managers are
the leaders who control the integrity of data in the system. It is imperative that data
managers be selected from the business units because they understand how the data
translates into the specific needs of their group. By keeping the LOBs responsible for the
data, there is a better chance of generating good quality data at source and in the course of
normal business conduct. 6.3 Formulate A Data Governance Board This group includes
representations of all business functions, data stakeholders and IT. Data managers could be
closely related or be members of this board. The board ensures that similar data quality
approaches and policies are adopted across the organization and across all functions of the
organization. The board meets periodically to define new data quality targets, perform
measurements, and analyze the state of data quality across various business units. 6.4 Build
A Data Quality Firewall ................................................................................................................... 14
7 FRAMEWORK TO ACHIEVE DATA QUALITY ........................................................................... 15

7.4 Improving data quality in operational systems .................................................................. 16
Data shared between data providers and consumers must be subject to contractual
agreements that clearly define acceptable levels of quality. Data metrics can be
incorporated into these contracts in the form of performance SLAs. ....................................... 16
The definition of mutually agreed data standards and data formats facilitates the flow of
data from one company to another. The metadata can be placed under a repository that is
actively managed by the data center, which would ensure that the data is represented in a
way that is enjoyable and beneficial for both collaborating parties. The gap analysis and
alignment of the business needs of both parties is performed by this data center. Data
quality inspections can be done manually or through automated routines to determine
work levels. Workflows can be defined to periodically monitor the data and take corrective
action accordingly, based on the expectations of the SLA targets and specified actions if
these SLAs are not respected. ......................................................................................................... 17
8 CONCLUSION ................................................................................................................................... 17
REFERENCES ........................................................................................................................................... 18
Synoptic Analysis of Contemporary Data Cleansing

Issues and Causes
Imran Ahmed, Shah Nawaz, MSCS,

Riphah Institute of Computing and Applied Sciences [RICAS] Riphah International University,
Lahore sub campus
ABSTRACT
The quality of a data set is judged by many parameters, including accuracy, consistency,
reliability, completeness, usefulness, and timeliness. Low quality data refers to missing, invalid,
irrelevant, outdated or incorrect data. Poor data quality does not just imply that the data has
been incorrectly acquired. There could be many other reasons that could lead to absolutely
valid data at one time for a function, becoming totally false for another company or function.
Data cleaning, also known as data cleansing, is designed to optimize the accuracy and quality of
data. The data cleansing process is based on the modification or deletion of incorrect,
incomplete, incorrectly formatted or duplicated data. Data cleansing can use analysis or other
methods to get rid of syntax errors, typographical errors, or fragments of records. In this
research, some of the modern problems encountered when cleaning up the data and how these
problems can be solved are listed in detail.
Key words:
Data mining, data cleansing, data scrubbing, problems of data cleansing, challenges of data
cleansing
1. INTRODUCTION
Data cleaning is a valuable process that can help businesses save time and increase efficiency.
Data cleansing frameworks are used by various organizations to delete duplicate data, correct
the incorrect data, fix and amend data with undesired format, and modify incomplete data in
marketing lists, and databases. Businesses can save not only time but money by adopting
suitable data cleansing techniques. Data cleansing is especially important for organizations that
have vast amounts of data to process. These organizations may include banks or government
organizations. In fact, many sources suggest that any company that uses and holds data should
invest in cleansing methodologies. Such techniques and methodologies should also be used
regularly as inaccurate data levels can increase rapidly. In this section we will look in detail
various aspects of data cleansing:
1.1 DATA QUALITY

Taking care of data quality is the key to safeguarding it and improving it from a quality point of
view. Ensuring that the data is of acceptable quality is a daunting task. This observation is the
most important in this age of information technology, when special attention to data is of great
importance if the information is not to become false information. Data can be called high quality
if they are "fit for purpose" in all their intended roles [11].
Data quality is an increasingly common problem for organizations trying to derive valuable
information from the data. Adequate data quality is important for various processes, as well as
the durability of BI (Business Intelligence) and Business Analytics (BA) reports. Data quality is
primarily affected by how data is entered, cleaned, processed, maintained, and analyzed. Some
Data Quality Assurance (DQA) procedures are used to verify the effectiveness and reliability of
the data [11].
Data quality is of pivotal importance for:
 Provision of precise information to handle accountability and services.
 Offering updated information to deal with the services effectively.
 Helping in prioritizing and guaranteeing effective resource utilization.
Adequate data quality maintenance requires planned data monitoring and cleansing. The process
of data quality maintenance involves updation, standardization of data and removal of
duplicating records to create a single data view.
Following are the key data quality components which are to be followed:
 Completeness: Level at which needed data characteristics are provided.

 Accuracy: The degree to which the result of a measurement and calculation meets to the
any specified standard.
 Credibility: Degree to which data is considered true.
 Timeliness: Level to which data is suitably updated.
 Consistency: Consistency assesses whether different dataset facts match.
 Integrity: Integrity is the overall accuracy, completeness and consistency of various
datasets.
1.2 DATA CLEANSING
The major areas that include data cleansing as part of their defining processes are data
warehousing, knowledge discovery in databases, and data information quality management (e.g.,
Total Data Quality Management TDQM). Data cleansing is defined in several (but similar) ways.
In data cleansing is the process of eliminating the errors and the inconsistencies in data and
solving the object identity problem [12].
Key areas that include data cleansing as part of their defining processes are data
warehousing, database discovery, and data quality management (Total Data Quality Management
TDQM). Data cleansing is defined in several ways (but similar). In data cleaning is the process
of eliminating errors and inconsistencies in the data and solving the identity problem of the
object
2 PROBLEMS FACED DURING DATA CLEANSING
Researchers are trying to tackle a number of problems in the data cleansing process. Dirty data is
of particular interest in the context of research. Organizations need to understand the various data
cleansing issues and how to solve them [3]. The need for data cleansing increases dramatically as
multiple data sources are integrated. This process of making data accurate and consistent is
fraught with many problems, some of which are mentioned below:
2.1 High Volume of Data:

Applications such as data warehouses continually load huge amounts of data from a variety of
sources and also carry a significant amount of corrupted data (data errors). In this case, the task
of cleansing the data becomes both significant and daunting.
2.2 Misspellings:
The spelling mistakes occur mainly because of a typo. The wrong spelling can be detected and
corrected for common words and grammatical errors, however, as the database constrains a
huge amount of data that is unique, it is difficult to detect the spelling error at the input level. In
addition, spelling mistakes in data such as names, addresses are always difficult to identify and
correct [5].
2.3 Lexical Errors:

Lexical errors occur in the data due to name differences between the structure of the data
elements and the specified format. Example: A particular database record attribute for name,
age, gender, and size. When an individual does not indicate an intermediate value, say (age)
that the data for the following attributes change their field. In the case above, when the
individual does not indicate the value for the age, the value for the sex, let us say that the
masculine is read as the age and the value of the height is read as the sex.
2.4 Misfielded Value:

Misfielded value problem occurs when the values entered are correct for the remote format but
do not belong to the field. Example in the area of the city, the registered value is Germany.
2.5 Domain Format Errors:

Domain format errors occur when the value of a particular attribute is correct but does not
match the domain format. For example, a particular NAME database requires that the first and
last names be separated by a comma but the entry is without a comma. In this case, the entry
may be correct but it does not conform to the domain format.
2.5 Irregularities
The irregularities concern the non-uniform use of units or values. For example, by entering the
salary of the employee, the salary is mentioned using different currencies. This type of data
requires subjective interpretation and can often lead to erroneous results.
2.7 Missing values

Missing values result from omissions that occur during data collection. They signify the
unavailability of values during the data entry process [7]. The dummy values and NULL values
are included in the missing values. For example, 000-0000 and 999-9999 in the phone number
field.
2.8 Contradiction
A contradiction error occurs when the same real-world entity is described by two different
values in the data. Example in a personal database for the same person there are two records
with two different birth dates, however, the other values and the entity are identical.
2.9 Duplication
The duplication problem means that the same data is represented more than once because of an
input error. For example, there may be two records of the same person with the same minimal
difference in the name without using the middle name in any of the entries. No data is wrong
but the person is represented twice for not having checked the duplicity.
2.10 Violations of Integrity Constraints or Illegal Values
Integrity constraints violations describe values that do not satisfy integrity value constraints.
This occurs when the input value is outside the allowed value limits to represent a particular
attribute.
2.11 Cryptic Values and Abbreviations:
This includes the use of encrypted values and abbreviations in the fields. Example instead of the
full name of the university using only initials. This type of error increases the chances of
duplication and reduces sorting capacity.

2.12 Attribute dependencies violated:
These errors when the value for a secondary does not match the main attribute. Example when
the city listed is not in the country or when the postal code does not coincide with the city
mentioned [10].
2.13 Wrong references:
Errors related to a poor result prevent data validation and result in data mismatch. For example
in the department field if a person enters a wrong value from the reference department. The
subsequent data validation process results in a discrepancy.
2.14 Integrated values:
This type of error occurs when multiple values are entered in the same field. This practice
severely limits the ability of indexing and sorting capabilities. For example, the values for name,
age, and gender are entered in the name field itself.
Data cleaning is an integral part of data management. It is necessary to make data

accurate, consistent and avoid duplication of information. The article highlights common
problems encountered in data cleansing and aims to serve as a guide for improving data quality
and data cleansing process. Each of the above problems can be easily avoided if the proper
procedures are followed during the cleaning design and execution task [8]. Outsourcing the
data cleaning task to an expert by providing data cleansing outsourcing services can
significantly speed up the task and ensure that your data gets and stays clean.
3 CHALLENGES IN DATA CLEANSING
3.1 Error correction and loss of information

The most difficult problem in data cleansing is the correction of values to remove duplicate
entries and invalid entries. In many cases, the information available about these anomalies is
limited and insufficient to determine the necessary transformations or corrections, leaving the
deletion of these entries as the primary solution. Deletion of data, however, results in loss of
information; this loss can be particularly costly if there is a large amount of data deleted [7].
3.2 Maintenance of cleansed data
Data cleaning is an expensive and time-consuming process. Thus, after cleaning up the data and
obtaining an error-free data collection, it would be best to avoid complete data cleanup after
changing some values in the data collection. The process should be repeated only on values that
have changed; this means that a cleansing line should be maintained, which would require
efficient data collection and management techniques.
3.3 Data cleansing in virtually integrated environments:
In quasi-integrated sources such as IBM's Discovery Link, data cleaning must be performed
each time data is accessed, which significantly increases response time and reduces efficiency
[6].
3.4 Framework for error detection:
In many cases, it will not be possible to derive a complete data cleansing chart to guide the
process in advance. This makes data cleansing an iterative process involving extensive
exploration and interaction, which may require a framework in the form of a collection of
methods for error detection and elimination in addition to data auditing. This can be integrated
with other data processing steps such as integration and maintenance.
2 DATA CLEANSING PROCEDURE
Data
Auditing
Use
Repeat Multiple
Methods
Data
Cleansing
Consolida
Feedback
te Data
Fig. 1 Data Cleansing Procedure
4.1 Data Auditing

The first step towards data cleansing is the complete audit of all customer databases. The audit
should be done using statistical and database methods to detect anomalies and inaccuracies.
The information must be used to infer the characteristics and location of the anomalies, which
can lead to the root cause of the problem.
4.2 Use Multiple Methods
The process of auditing a database should not be limited to analysis by statistical or database
methods, and additional steps such as purchasing external data and comparing with internal
data can be used. In addition, if an organization has time and staff constraints, it can use the
services of an external telemarketing company. However, in this approach, the organization
must be careful about its brand image and the way the external company works.
4.3 Consolidate Data

The process of cleaning up the database should not be limited to identifying and deleting
erroneous (inaccurate) data from the client database. It should be used as an opportunity to
consolidate customer data and additional information such as e-mail addresses, phone numbers
or other contacts should be incorporated wherever possible.
4.4 Feedback
The organization must put in place a control mechanism in which any inaccurate information is
reported and updated in the database. For example, there should be a control and feedback
mechanism for emails and any email that is not delivered due to an incorrect address, should be
reported and the invalid email address cleaned of customer data. ].
4.5 Repeat
People's lives are becoming more dynamic and the associated details such as addresses, phone
number, and e-mail address of the company change frequently. Thus, the data cleansing process
should not be considered a one-time process; Instead, it should be part of the regular workflow.
Regular weeding of inaccurate information and updating of the customer database is the only
way to ensure a proper customer database.
Data cleansing is a difficult but essential process and requires dedication of time and
resources. The procedures mentioned above would certainly help in creating a customer-
specific database that offers multiple benefits across functions and serves as a critical factor in
business growth. Therefore, companies should make investments in data cleansing and data
management a top priority.
Data is at the heart of any business function. The main processes and business decisions
depend on the relevant data. However, in today's competitive and technological world, the
existence of data does not only guarantee the proper functioning of the corresponding business
functions. The quality of the underlying data is of paramount importance to ensure correct
decisions.
3 STAGES WHERE PROBLEMS ARISE
5.1 Data Acquisition Process

The data is entered into the company's systems in several ways, including manual and
automatic. Data migration in large volumes introduces serious errors. Many factors contribute
to the poor quality of incoming data
5.2 Initial Data Conversion

When data is migrated from one existing database to another, many quality issues can arise.
Source data may itself be incorrect because of its own limitations; or mapping the old database
to the new database may have inconsistencies or the conversion routines may map them
incorrectly. We often see that "legacy" systems have metadata that are not synchronized with
the actual schema implemented. The accuracy of the data dictionary serves as a basis for
conversion algorithms, mapping and effort. If the dictionary and actual data are out of sync, this
can lead to major data quality issues.
5.3 System Consolidation
Mergers and acquisitions of companies lead to consolidation of data. Since the focus is
primarily on streamlining business processes, the combination of data is usually less important.
This can be catastrophic, especially if the data experts from the previous company are not
involved in the consolidation process, and desynchronized metadata is a problem. Merging two
databases that do not have compatible fields can result in data misassembly and compromise
the accuracy of the data.
5.4 Manual Data Entry

The data is entered manually into the system several times and is therefore subject to human
error. Because user data is often entered through a variety of user-friendly interfaces, it may not
be directly compatible with internal data representation. In addition, end-users tend to provide
"shortened" information in areas they consider unimportant, but which may be crucial for
internal data management. The data operator may not have the expertise to understand this
data and may incorrectly fill values in erroneous fields or may distort information.
5.5 Batch Feeds
Often, automated processes are used to fill large volumes of similar data in batches, saving time
and energy. Systems that push this mass of data can also inject large amounts of erroneous data.
This can be very disastrous, especially when data travels through a number of serial databases.
Wrong data can trigger incorrect processes that can lead to incorrect decisions that can have
significant negative consequences for the company [9]. Usually, the flow of data through the
integration systems is not fully tested, and any software process upgrades in the data chain
with inadequate regression testing can have a large adverse effect on the data. live.
5.6 Real Time Interfaces:

This is completely opposite to the batch streams. With real-time interfaces and applications
becoming the flavor of the interactive and enhanced user experience, data enters the database in
real time and often propagates rapidly to the interconnected database chain [9]. This almost
immediately triggers actions and responses that may be visible to the user, leaving little room
for validation and verification. This causes a huge hole in data quality assurance where a bad
input can wreak havoc on the back.
5.7 Internal data changes:
The company can run processes that modify the data residing in the system. This can lead to the
inadvertent introduction of errors in the data. The following processes are responsible for
internal changes to business data
5.8 Data Processing
The data of a company must be treated regularly for the summaries, the calculations and the
compensation. There may have been a proven and proven cycle of data processing of this type
in the past. The code of collation programs, the processes themselves, as well as the actual data
evolve over time; therefore, a repeated snack cycle may not give similar results. The processed
data can be completely shifted and if they form the basis of other successive processing, the
error can be reduced in several ways.
5.9 Data Cleansing:
Each company must periodically correct its incorrect data. Manual cleaning has been supported
by the time and effort of saving automation. Although this is very useful, it may incorrectly
affect thousands of records. The software used to automate may have bugs, or the data
specifications that form the basis of the cleaning algorithms may be incorrect. This can result in
absolutely valid, invalid data, and virtually reverse the very benefit of the cleansing exercise [4].
5.10 Data Purging:
Old data must be constantly removed from the system to save valuable storage space and
reduce the maintenance effort required to maintain gigantic and obsolete volumes of
information. Purging destroys the data. Therefore, an erroneous or accidental deletion may
affect the quality of the data by chance. As with cleanup, bugs and incorrect data specifications
in the purge software can trigger unwarranted destruction of valuable data. Sometimes valid
data may incorrectly match the purge criteria and be erased.
5.11 Data Manipulation:

The data contained in the company's databases is subject to manipulation due to system
upgrades, redesign of the database and other such exercises. This leads to data quality issues
because the staff involved may not be a data expert and the data specifications may not be
reliable. Such deterioration of data is called data disintegration. Some reasons explain this
problem
5.12 Changes in Data Not Captured:
Data represents real-world objects that can change by themselves over time, and data
representation may not catch up with this change. Thus, these data are automatically aged and
transformed into a meaningless form. Similarly, in the case of interconnected systems, changes
made to a branch are not migrated to the interfacing systems. This can lead to huge
inconsistencies in the data that may appear unfavorably at a later stage, often after the damage
has occurred.
5.13 System Upgrades:
System upgrades are unavoidable and such exercises rely heavily on data specifications for the
representation of expected data. In reality, the data is far from the documented behavior and the
result is often chaotic. A poorly tested system upgrade can cause irreparable damage to data
quality.
5.14 New uses of data
Businesses need to find more revenue-generating uses for existing data, which can open up new
problems. Data intended for one purpose may practically not be suitable for another purpose
and its use for new purposes may lead to misinterpretations and assumptions in the new
domain.
5.15 Loss of expertise
Experts in data and data seal an impenetrable pact; the expert usually has an "eye" for false
data, knows the exceptions well, knows how to extract the relevant data and throw away the
rest. This is due to long years of association with "the legacy". When such experts withdraw,
move on or are abandoned due to a new merger, the new data processing member may not be
aware of these data anomalies that have been rectified before by the experts. As a result, the
incorrect data may be unchecked in a process.
5.16 Automation of internal processes :
As more and more applications with higher levels of automation share huge amounts of data,
users are more exposed to erroneous internal data previously ignored. Companies risk losing
their credibility in the event of such exposure. Automation can not replace the need to validate
information; Intentional and unintentional modification of data by users may also result in data
degradation, which may be beyond the control of the enterprise. In conclusion, the quality of
the data may be lost due to processes that bring data into the system. data and through a
process of data aging and decay where the data themselves may not change over time.
New business models in modern and rapidly evolving scenarios are introducing new
and innovative functions and processes every day. Each automation cycle or redesign of an
existing process generates its own set of unforeseen challenges that affect the quality of the data.
The key to ensuring data quality is a specific study of the flow of data within each process and
the implementation of a regular audit and monitoring mechanism to detect data degradation. A
mixture of automation and manual validation and cleaning by trained data processors is the
need of the hour. Challenges in data quality should be the main goal of any business if it is to
ensure its proper functioning and future growth.
6 SIGNIFICANCE OF THIS RESEARCH
Companies, large or small, struggle to maintain the quality of the ever-increasing data volumes
necessary for smooth operation. Data quality management does not mean just sniffing
periodically and eliminating bad data. It is inherently necessary for businesses to incorporate
data quality into streamlining and process integration. Obsolete or incorrect data can lead to
major errors in business decisions [2].
Many strategies have been adopted by companies for effective management of data quality. A
focused approach to data governance and data management can have significant benefits. A
proactive approach to controlling, monitoring, and controlling data quality is the key, rather
than responding to data failures or dealing with detected data anomalies. Some of the key
strategies are listed below:
Let Business Drive Data Quality
Appoint Data Stewards
Formulate A Data Governance Board
Build A Data Quality Firewall
Fig. 2 Framework to Avoid Pitfall

6.1 Let Business Drive Data Quality
The main goal of Data is to fuel business. Rather than letting IT keep the reins of data quality,
the business units that are the primary users of this data are better equipped to define the data
quality parameters. If business intelligence is closely tied to the underlying data, there is a better
chance of adopting effective methodologies that focus on strategic priority data.
6.2 Appoint Data Stewards
These are roles created specifically to define the owners of data quality. Data managers are the
leaders who control the integrity of data in the system. It is imperative that data managers be
selected from the business units because they understand how the data translates into the
specific needs of their group. By keeping the LOBs responsible for the data, there is a better
chance of generating good quality data at source and in the course of normal business conduct.
6.3 Formulate A Data Governance Board
This group includes representations of all business functions, data stakeholders and IT. Data
managers could be closely related or be members of this board. The board ensures that similar
data quality approaches and policies are adopted across the organization and across all
functions of the organization. The board meets periodically to define new data quality targets,
perform measurements, and analyze the state of data quality across various business units.
6.4 Build A Data Quality Firewall
Data within the enterprise is a financial asset, so it makes sense to have checks and balances to
ensure that the data entering the systems is of acceptable quality. In addition, whenever this
data is retrieved or modified, there is a potential risk of losing its "precision". Wrong data can
flow downstream and pollute the following data banks, which has an impact on the business.
Building an intelligent virtual firewall can detect and block erroneous data to the point where it
enters the system. Corrupted data automatically detected by the firewall is either returned to
the original source to be rectified or adjusted before moving into the corporate environment.
7 FRAMEWORK TO ACHIEVE DATA QUALITY
Data quality management is a cyclical process that involves a step-by-step logical

implementation. These quantifiable steps can help standardize solid data management practices
that can be deployed in incremental cycles to incorporate higher levels of data quality
techniques into the enterprise architecture. The best practices are classified in successive phases
listed below:
Phase 1
•Data Quality Assessment
Phase 2
•Data Quality Measurement
Phase 3
•Incorporating Data Quality into the
functions and processes
Phase 4
•Data Quality Improvement in operational
systems
Phase 5
•Inspect cases where Data Quality
standards are not met and taking remedial
actions
Fig. 3 Framework to achieve data quality
7.1 Data Quality Assessment

This means essentially submitting the company's data stores to a detailed inspection, in order to
be able to determine data quality issues in its environment. An independent evaluation focused
on data quality is of paramount importance in identifying how poor quality data impedes
business objectives. It provides a baseline for investing and planning improvements in data
quality and measuring the results of successive improvements.
The evaluation of the data must be guided by an impact analysis of the data on
companies. The criticality of the data must be an important parameter in defining the scope and
priority of the data to be evaluated. This top-down approach can be complemented by the
bottom-up strategy of data profiling assessment, which identifies anomalies in the data, and
then matches those anomalies to the potential impact on business goals. . This correlation
provides a basis for measuring data quality and their relationship to business impact.
This phase must be completed by a formal report that clearly lists the results. The report can be
disseminated to stakeholders, decision-makers and therefore lead to actions to improve data
quality.
7.2 Measurement of data quality
The result of the data evaluation report is used to refine the scope to identify critical data
elements. Attributes and dimensions to measure the quality of these data, to define the units of
measurement and to define the acceptable thresholds for these measures form the basis for the
implementation of improvement processes. Attributes such as completeness, consistency,
timeliness are defined that act as input to deciding which tools and techniques need to be
deployed to achieve the desired quality levels. The validity rules of the data are specified
according to these measures. This can help to press data controls in functions that acquire or
modify data in the data lifecycle [1]. In turn, dashboards and dashboards of data quality can be
defined for each business unit derived from these metrics and their thresholds. These scores can
be captured, stored and periodically updated to monitor improvement.
7.3 Integration of data quality into functions and processes:
The emphasis on feature creation takes precedence over data quality during application
development or system upgrade. The metrics defined above can be used to integrate data
quality targets into the system development lifecycle, integrated as mandatory requirements for
each phase of development. Data quality analysts must identify the data requirements for each
application. A complete walkthrough of the data flow within each application provides insight
into the likely points of insertion for the inspection and control routines. These requirements
must be added to the system's functional requirements for seamless integration into the
development lifecycle, validating the data as it is introduced into the system.
7.4 Improving data quality in operational systems

Data shared between data providers and consumers must be subject to contractual agreements
that clearly define acceptable levels of quality. Data metrics can be incorporated into these
contracts in the form of performance SLAs.
The definition of mutually agreed data standards and data formats facilitates the flow of
data from one company to another. The metadata can be placed under a repository that is
actively managed by the data center, which would ensure that the data is represented in a way
that is enjoyable and beneficial for both collaborating parties. The gap analysis and alignment of
the business needs of both parties is performed by this data center.
Data quality inspections can be done manually or through automated routines to determine
work levels. Workflows can be defined to periodically monitor the data and take corrective
action accordingly, based on the expectations of the SLA targets and specified actions if these
SLAs are not respected.
7.5 Inspect instances where data quality standards are not being met and take corrective
action:
When data are deemed to be below expected levels, corrective actions must be subject to
effective data quality monitoring mechanisms, as well as defect tracking systems in software
development. Reporting data defects and tracking actions can help generate performance
reports. A root cause analysis performed on each reported data error provides direct feedback
to understand the flaws in the business processes.
In addition to the above, proactive cleaning and data correction cycles must be
performed from time to time to identify and detect more data errors that may have been
introduced despite strict quality controls [11].
Data quality can be assured at peak or near-peak levels using effective data management
tools to facilitate and provide a solid framework for the implementation of data quality
measurement, monitoring and reporting. subsequent improvements. The chosen quality
management solution must closely match the unique business objectives of the company. Data
quality objectives and management plans must be shared between producers, consumers,
commercial application developers, developers and sales managers. Data quality, after all, is a
joint responsibility. Setting up high-level data entry processes is essential to ensure this.
8 CONCLUSION
The main reasons for poor data quality may be incorrect spellings when entering data, invalid
data, missing information, etc. It is important that the right data is used, cleaned and analyzed
to make the best decisions possible. Without a doubt, during the process of cleansing up the
data, it is inevitable to encounter several problems and we must find a way to solve all these
problems.
REFERENCES
[1] Kofi Adu-Manu, A. Kingsley Arthur. J. (2013). A Review of Data Cleansing Concepts
chievable Goals and Limitations, International Journal of Computer Applications (0975 –
8887) Volume 76– No.7
[2] Dwivedi, S. Rawat, B. (2015), A Review Paper on Data Preprocessing: A Critical Phase in
Web Usage Mining Process, International Conference on Green Computing and Internet of
Things (ICGCIoT)
[3] Choudhary, N. (2014), Study over Problems and Approaches of Data Cleansing/Cleaning ,
International Journal of Advanced Research in Computer Science and Software Engineering,
Volume 4, Issue 2
[4] Kofi Adu-Manu, A. Kingsley Arthur. J. (2013). Analysis of Data Cleansing Approaches
regarding Dirty Data – A Comparative Study International Journal of Computer
Applications (0975 – 8887) Volume 76– No.7
[5] Savitri F.N., Laksmiwati H.( 2011), Study of Localized Data Cleansing Process for ETL
Performance Improvement in Independent Datamart, International Conference on Electrical
Engineering and Informatics, 17-19 July 2011, Bandung, Indonesia
[6] Pachano, L.A., Khoshgoftaar T.M., Wald, W, (2013), Survey of Data Cleansing and
Monitoring for Large-Scale Battery Backup, Installations, 12th International Conference on
Machine Learning and Applications, IEEE Computing Society
[7] Peng, T., A framework for data cleaning in data warehouses, school of computing, napier
university, 10 Colinton Road, Edinburgh, EH10 5DT, UK
[8] Volkovs. M., Chiang, F., Szlichta J, Ren´ee J. Miller , Continuous Data Cleaning, IBM
Centre for Advanced Studies in Toronto
[9] Pachano, L.A., Khoshgoftaar T.M., Wald, W, (2013), Survey of Data Cleansing and
Monitoring for Large-Scale Battery Backup, Installations, 12th International Conference on
Machine Learning and Applications, IEEE Computing Society
[10] Peng, T., A framework for data cleaning in data warehouses, school of computing, napier
university, 10 Colinton Road, Edinburgh, EH10 5DT, UK
[11] Volkovs. M., Chiang, F., Szlichta J, Ren´ee J. Miller , Continuous Data Cleaning, IBM
Centre for Advanced Studies in Toronto
[12] Galhardas, H. Data Cleaning: Model, Language and Algoritmes. University of Versailles, Saint-Quentin-
En-Yvelines, Ph.D., 2001.

Synoptic Analysis of Contemporary Data Cleansing Issues and Causes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Synoptic Analysis of Contemporary Data Cleansing Issues and Causes

Uploaded by

Copyright:

Available Formats

Riphah Institute of Computing and Applied Sciences [RICAS], Lahore Campus 1

1.1 DATA QUALITY........................................................................................................................ 5

1.2 DATA CLEANSING ................................................................................................................. 5

2.1 High Volume of Data:...................................................................................................................... 6

2.2 Misspellings: ..................................................................................................................................... 6

2.4 Misfielded Value: ............................................................................................................................. 7

2.5 Irregularities ..................................................................................................................................... 7

2.7 Missing values .................................................................................................................................. 7

2 CHALLENGES IN DATA CLEANSING............................................................................................ 8

4 STAGES WHERE PROBLEMS ARISE ............................................................................................ 10

5.2 Initial Data Conversion .............................................................................................................. 10

5.3 System Consolidation.................................................................................................................. 11

5.4 Manual Data Entry ..................................................................................................................... 11

6 SIGNIFICANCE OF THIS RESEARCH ........................................................................................... 13

7 FRAMEWORK TO ACHIEVE DATA QUALITY ........................................................................... 15

Synoptic Analysis of Contemporary Data Cleansing

Imran Ahmed, Shah Nawaz, MSCS,

1.1 DATA QUALITY

 Completeness: Level at which needed data characteristics are provided.

1.2 DATA CLEANSING

2 PROBLEMS FACED DURING DATA CLEANSING

2.1 High Volume of Data:

2.3 Lexical Errors:

2.4 Misfielded Value:

2.5 Domain Format Errors:

2.7 Missing values

duplication and reduces sorting capacity.

Data cleaning is an integral part of data management. It is necessary to make data

3 CHALLENGES IN DATA CLEANSING

3.1 Error correction and loss of information

2 DATA CLEANSING PROCEDURE

Fig. 1 Data Cleansing Procedure

4.1 Data Auditing

4.3 Consolidate Data

3 STAGES WHERE PROBLEMS ARISE

5.1 Data Acquisition Process

5.2 Initial Data Conversion

5.3 System Consolidation

5.4 Manual Data Entry

5.6 Real Time Interfaces:

5.11 Data Manipulation:

6 SIGNIFICANCE OF THIS RESEARCH

Let Business Drive Data Quality

Appoint Data Stewards

Formulate A Data Governance Board

Build A Data Quality Firewall

Fig. 2 Framework to Avoid Pitfall

7 FRAMEWORK TO ACHIEVE DATA QUALITY

Data quality management is a cyclical process that involves a step-by-step logical

Fig. 3 Framework to achieve data quality

7.1 Data Quality Assessment

7.4 Improving data quality in operational systems

You might also like