Professional Documents
Culture Documents
Synoptic Analysis of Contemporary Data Cleansing Issues and Causes
Synoptic Analysis of Contemporary Data Cleansing Issues and Causes
Table of Contents
ABSTRACT.................................................................................................................................................. 4
Key words: .................................................................................................................................................... 4
1. INTRODUCTION .............................................................................................................................. 4
The data is entered into the company's systems in several ways, including manual and
automatic. Data migration in large volumes introduces serious errors. Many factors
contribute to the poor quality of incoming data .......................................................................... 10
When data is migrated from one existing database to another, many quality issues can
arise. Source data may itself be incorrect because of its own limitations; or mapping the old
database to the new database may have inconsistencies or the conversion routines may
map them incorrectly. We often see that "legacy" systems have metadata that are not
synchronized with the actual schema implemented. The accuracy of the data dictionary
serves as a basis for conversion algorithms, mapping and effort. If the dictionary and actual
data are out of sync, this can lead to major data quality issues................................................. 10
Riphah Institute of Computing and Applied Sciences [RICAS], Lahore Campus 2
Mergers and acquisitions of companies lead to consolidation of data. Since the focus is
primarily on streamlining business processes, the combination of data is usually less
important. This can be catastrophic, especially if the data experts from the previous
company are not involved in the consolidation process, and desynchronized metadata is a
problem. Merging two databases that do not have compatible fields can result in data
misassembly and compromise the accuracy of the data. ........................................................... 11
6.1 Let Business Drive Data Quality The main goal of Data is to fuel business. Rather than
letting IT keep the reins of data quality, the business units that are the primary users of this
data are better equipped to define the data quality parameters. If business intelligence is
closely tied to the underlying data, there is a better chance of adopting effective
methodologies that focus on strategic priority data. 6.2 Appoint Data Stewards.................... 14
These are roles created specifically to define the owners of data quality. Data managers are
the leaders who control the integrity of data in the system. It is imperative that data
managers be selected from the business units because they understand how the data
translates into the specific needs of their group. By keeping the LOBs responsible for the
data, there is a better chance of generating good quality data at source and in the course of
normal business conduct. 6.3 Formulate A Data Governance Board This group includes
representations of all business functions, data stakeholders and IT. Data managers could be
closely related or be members of this board. The board ensures that similar data quality
approaches and policies are adopted across the organization and across all functions of the
organization. The board meets periodically to define new data quality targets, perform
measurements, and analyze the state of data quality across various business units. 6.4 Build
A Data Quality Firewall ................................................................................................................... 14
Data shared between data providers and consumers must be subject to contractual
agreements that clearly define acceptable levels of quality. Data metrics can be
incorporated into these contracts in the form of performance SLAs. ....................................... 16
The definition of mutually agreed data standards and data formats facilitates the flow of
data from one company to another. The metadata can be placed under a repository that is
actively managed by the data center, which would ensure that the data is represented in a
way that is enjoyable and beneficial for both collaborating parties. The gap analysis and
Riphah Institute of Computing and Applied Sciences [RICAS], Lahore Campus 3
alignment of the business needs of both parties is performed by this data center. Data
quality inspections can be done manually or through automated routines to determine
work levels. Workflows can be defined to periodically monitor the data and take corrective
action accordingly, based on the expectations of the SLA targets and specified actions if
these SLAs are not respected. ......................................................................................................... 17
8 CONCLUSION ................................................................................................................................... 17
REFERENCES ........................................................................................................................................... 18
Riphah Institute of Computing and Applied Sciences [RICAS], Lahore Campus 4
ABSTRACT
The quality of a data set is judged by many parameters, including accuracy, consistency,
reliability, completeness, usefulness, and timeliness. Low quality data refers to missing, invalid,
irrelevant, outdated or incorrect data. Poor data quality does not just imply that the data has
been incorrectly acquired. There could be many other reasons that could lead to absolutely
valid data at one time for a function, becoming totally false for another company or function.
Data cleaning, also known as data cleansing, is designed to optimize the accuracy and quality of
data. The data cleansing process is based on the modification or deletion of incorrect,
incomplete, incorrectly formatted or duplicated data. Data cleansing can use analysis or other
methods to get rid of syntax errors, typographical errors, or fragments of records. In this
research, some of the modern problems encountered when cleaning up the data and how these
problems can be solved are listed in detail.
Key words:
Data mining, data cleansing, data scrubbing, problems of data cleansing, challenges of data
cleansing
1. INTRODUCTION
Data cleaning is a valuable process that can help businesses save time and increase efficiency.
Data cleansing frameworks are used by various organizations to delete duplicate data, correct
the incorrect data, fix and amend data with undesired format, and modify incomplete data in
marketing lists, and databases. Businesses can save not only time but money by adopting
suitable data cleansing techniques. Data cleansing is especially important for organizations that
have vast amounts of data to process. These organizations may include banks or government
organizations. In fact, many sources suggest that any company that uses and holds data should
invest in cleansing methodologies. Such techniques and methodologies should also be used
Riphah Institute of Computing and Applied Sciences [RICAS], Lahore Campus 5
regularly as inaccurate data levels can increase rapidly. In this section we will look in detail
various aspects of data cleansing:
Following are the key data quality components which are to be followed:
The major areas that include data cleansing as part of their defining processes are data
warehousing, knowledge discovery in databases, and data information quality management (e.g.,
Total Data Quality Management TDQM). Data cleansing is defined in several (but similar) ways.
Riphah Institute of Computing and Applied Sciences [RICAS], Lahore Campus 6
In data cleansing is the process of eliminating the errors and the inconsistencies in data and
solving the object identity problem [12].
Key areas that include data cleansing as part of their defining processes are data
warehousing, database discovery, and data quality management (Total Data Quality Management
TDQM). Data cleansing is defined in several ways (but similar). In data cleaning is the process
of eliminating errors and inconsistencies in the data and solving the identity problem of the
object
Researchers are trying to tackle a number of problems in the data cleansing process. Dirty data is
of particular interest in the context of research. Organizations need to understand the various data
cleansing issues and how to solve them [3]. The need for data cleansing increases dramatically as
multiple data sources are integrated. This process of making data accurate and consistent is
fraught with many problems, some of which are mentioned below:
2.2 Misspellings:
The spelling mistakes occur mainly because of a typo. The wrong spelling can be detected and
corrected for common words and grammatical errors, however, as the database constrains a
huge amount of data that is unique, it is difficult to detect the spelling error at the input level. In
addition, spelling mistakes in data such as names, addresses are always difficult to identify and
correct [5].
2.5 Irregularities
The irregularities concern the non-uniform use of units or values. For example, by entering the
salary of the employee, the salary is mentioned using different currencies. This type of data
requires subjective interpretation and can often lead to erroneous results.
In quasi-integrated sources such as IBM's Discovery Link, data cleaning must be performed
each time data is accessed, which significantly increases response time and reduces efficiency
[6].
3.4 Framework for error detection:
In many cases, it will not be possible to derive a complete data cleansing chart to guide the
process in advance. This makes data cleansing an iterative process involving extensive
exploration and interaction, which may require a framework in the form of a collection of
methods for error detection and elimination in addition to data auditing. This can be integrated
with other data processing steps such as integration and maintenance.
Data
Auditing
Use
Repeat Multiple
Methods
Data
Cleansing
Consolida
Feedback
te Data
to the new database may have inconsistencies or the conversion routines may map them
incorrectly. We often see that "legacy" systems have metadata that are not synchronized with
the actual schema implemented. The accuracy of the data dictionary serves as a basis for
conversion algorithms, mapping and effort. If the dictionary and actual data are out of sync, this
can lead to major data quality issues.
Mergers and acquisitions of companies lead to consolidation of data. Since the focus is
primarily on streamlining business processes, the combination of data is usually less important.
This can be catastrophic, especially if the data experts from the previous company are not
involved in the consolidation process, and desynchronized metadata is a problem. Merging two
databases that do not have compatible fields can result in data misassembly and compromise
the accuracy of the data.
inadvertent introduction of errors in the data. The following processes are responsible for
internal changes to business data
5.8 Data Processing
The data of a company must be treated regularly for the summaries, the calculations and the
compensation. There may have been a proven and proven cycle of data processing of this type
in the past. The code of collation programs, the processes themselves, as well as the actual data
evolve over time; therefore, a repeated snack cycle may not give similar results. The processed
data can be completely shifted and if they form the basis of other successive processing, the
error can be reduced in several ways.
5.9 Data Cleansing:
Each company must periodically correct its incorrect data. Manual cleaning has been supported
by the time and effort of saving automation. Although this is very useful, it may incorrectly
affect thousands of records. The software used to automate may have bugs, or the data
specifications that form the basis of the cleaning algorithms may be incorrect. This can result in
absolutely valid, invalid data, and virtually reverse the very benefit of the cleansing exercise [4].
5.10 Data Purging:
Old data must be constantly removed from the system to save valuable storage space and
reduce the maintenance effort required to maintain gigantic and obsolete volumes of
information. Purging destroys the data. Therefore, an erroneous or accidental deletion may
affect the quality of the data by chance. As with cleanup, bugs and incorrect data specifications
in the purge software can trigger unwarranted destruction of valuable data. Sometimes valid
data may incorrectly match the purge criteria and be erased.
representation of expected data. In reality, the data is far from the documented behavior and the
result is often chaotic. A poorly tested system upgrade can cause irreparable damage to data
quality.
5.14 New uses of data
Businesses need to find more revenue-generating uses for existing data, which can open up new
problems. Data intended for one purpose may practically not be suitable for another purpose
and its use for new purposes may lead to misinterpretations and assumptions in the new
domain.
5.15 Loss of expertise
Experts in data and data seal an impenetrable pact; the expert usually has an "eye" for false
data, knows the exceptions well, knows how to extract the relevant data and throw away the
rest. This is due to long years of association with "the legacy". When such experts withdraw,
move on or are abandoned due to a new merger, the new data processing member may not be
aware of these data anomalies that have been rectified before by the experts. As a result, the
incorrect data may be unchecked in a process.
5.16 Automation of internal processes :
As more and more applications with higher levels of automation share huge amounts of data,
users are more exposed to erroneous internal data previously ignored. Companies risk losing
their credibility in the event of such exposure. Automation can not replace the need to validate
information; Intentional and unintentional modification of data by users may also result in data
degradation, which may be beyond the control of the enterprise. In conclusion, the quality of
the data may be lost due to processes that bring data into the system. data and through a
process of data aging and decay where the data themselves may not change over time.
New business models in modern and rapidly evolving scenarios are introducing new
and innovative functions and processes every day. Each automation cycle or redesign of an
existing process generates its own set of unforeseen challenges that affect the quality of the data.
The key to ensuring data quality is a specific study of the flow of data within each process and
the implementation of a regular audit and monitoring mechanism to detect data degradation. A
mixture of automation and manual validation and cleaning by trained data processors is the
need of the hour. Challenges in data quality should be the main goal of any business if it is to
ensure its proper functioning and future growth.
Companies, large or small, struggle to maintain the quality of the ever-increasing data volumes
necessary for smooth operation. Data quality management does not mean just sniffing
periodically and eliminating bad data. It is inherently necessary for businesses to incorporate
Riphah Institute of Computing and Applied Sciences [RICAS], Lahore Campus 14
data quality into streamlining and process integration. Obsolete or incorrect data can lead to
major errors in business decisions [2].
Many strategies have been adopted by companies for effective management of data quality. A
focused approach to data governance and data management can have significant benefits. A
proactive approach to controlling, monitoring, and controlling data quality is the key, rather
than responding to data failures or dealing with detected data anomalies. Some of the key
strategies are listed below:
Data within the enterprise is a financial asset, so it makes sense to have checks and balances to
ensure that the data entering the systems is of acceptable quality. In addition, whenever this
data is retrieved or modified, there is a potential risk of losing its "precision". Wrong data can
flow downstream and pollute the following data banks, which has an impact on the business.
Building an intelligent virtual firewall can detect and block erroneous data to the point where it
enters the system. Corrupted data automatically detected by the firewall is either returned to
the original source to be rectified or adjusted before moving into the corporate environment.
Phase 1
•Data Quality Assessment
Phase 2
•Data Quality Measurement
Phase 3
•Incorporating Data Quality into the
functions and processes
Phase 4
•Data Quality Improvement in operational
systems
Phase 5
•Inspect cases where Data Quality
standards are not met and taking remedial
actions
business objectives. It provides a baseline for investing and planning improvements in data
quality and measuring the results of successive improvements.
The evaluation of the data must be guided by an impact analysis of the data on
companies. The criticality of the data must be an important parameter in defining the scope and
priority of the data to be evaluated. This top-down approach can be complemented by the
bottom-up strategy of data profiling assessment, which identifies anomalies in the data, and
then matches those anomalies to the potential impact on business goals. . This correlation
provides a basis for measuring data quality and their relationship to business impact.
This phase must be completed by a formal report that clearly lists the results. The report can be
disseminated to stakeholders, decision-makers and therefore lead to actions to improve data
quality.
7.2 Measurement of data quality
The result of the data evaluation report is used to refine the scope to identify critical data
elements. Attributes and dimensions to measure the quality of these data, to define the units of
measurement and to define the acceptable thresholds for these measures form the basis for the
implementation of improvement processes. Attributes such as completeness, consistency,
timeliness are defined that act as input to deciding which tools and techniques need to be
deployed to achieve the desired quality levels. The validity rules of the data are specified
according to these measures. This can help to press data controls in functions that acquire or
modify data in the data lifecycle [1]. In turn, dashboards and dashboards of data quality can be
defined for each business unit derived from these metrics and their thresholds. These scores can
be captured, stored and periodically updated to monitor improvement.
7.3 Integration of data quality into functions and processes:
The emphasis on feature creation takes precedence over data quality during application
development or system upgrade. The metrics defined above can be used to integrate data
quality targets into the system development lifecycle, integrated as mandatory requirements for
each phase of development. Data quality analysts must identify the data requirements for each
application. A complete walkthrough of the data flow within each application provides insight
into the likely points of insertion for the inspection and control routines. These requirements
must be added to the system's functional requirements for seamless integration into the
development lifecycle, validating the data as it is introduced into the system.
The definition of mutually agreed data standards and data formats facilitates the flow of
data from one company to another. The metadata can be placed under a repository that is
actively managed by the data center, which would ensure that the data is represented in a way
that is enjoyable and beneficial for both collaborating parties. The gap analysis and alignment of
the business needs of both parties is performed by this data center.
Data quality inspections can be done manually or through automated routines to determine
work levels. Workflows can be defined to periodically monitor the data and take corrective
action accordingly, based on the expectations of the SLA targets and specified actions if these
SLAs are not respected.
7.5 Inspect instances where data quality standards are not being met and take corrective
action:
When data are deemed to be below expected levels, corrective actions must be subject to
effective data quality monitoring mechanisms, as well as defect tracking systems in software
development. Reporting data defects and tracking actions can help generate performance
reports. A root cause analysis performed on each reported data error provides direct feedback
to understand the flaws in the business processes.
In addition to the above, proactive cleaning and data correction cycles must be
performed from time to time to identify and detect more data errors that may have been
introduced despite strict quality controls [11].
Data quality can be assured at peak or near-peak levels using effective data management
tools to facilitate and provide a solid framework for the implementation of data quality
measurement, monitoring and reporting. subsequent improvements. The chosen quality
management solution must closely match the unique business objectives of the company. Data
quality objectives and management plans must be shared between producers, consumers,
commercial application developers, developers and sales managers. Data quality, after all, is a
joint responsibility. Setting up high-level data entry processes is essential to ensure this.
8 CONCLUSION
The main reasons for poor data quality may be incorrect spellings when entering data, invalid
data, missing information, etc. It is important that the right data is used, cleaned and analyzed
to make the best decisions possible. Without a doubt, during the process of cleansing up the
data, it is inevitable to encounter several problems and we must find a way to solve all these
problems.
Riphah Institute of Computing and Applied Sciences [RICAS], Lahore Campus 18
REFERENCES
[1] Kofi Adu-Manu, A. Kingsley Arthur. J. (2013). A Review of Data Cleansing Concepts
chievable Goals and Limitations, International Journal of Computer Applications (0975 –
8887) Volume 76– No.7
[2] Dwivedi, S. Rawat, B. (2015), A Review Paper on Data Preprocessing: A Critical Phase in
Web Usage Mining Process, International Conference on Green Computing and Internet of
Things (ICGCIoT)
[3] Choudhary, N. (2014), Study over Problems and Approaches of Data Cleansing/Cleaning ,
International Journal of Advanced Research in Computer Science and Software Engineering,
Volume 4, Issue 2
[4] Kofi Adu-Manu, A. Kingsley Arthur. J. (2013). Analysis of Data Cleansing Approaches
regarding Dirty Data – A Comparative Study International Journal of Computer
Applications (0975 – 8887) Volume 76– No.7
[5] Savitri F.N., Laksmiwati H.( 2011), Study of Localized Data Cleansing Process for ETL
Performance Improvement in Independent Datamart, International Conference on Electrical
Engineering and Informatics, 17-19 July 2011, Bandung, Indonesia
[6] Pachano, L.A., Khoshgoftaar T.M., Wald, W, (2013), Survey of Data Cleansing and
Monitoring for Large-Scale Battery Backup, Installations, 12th International Conference on
Machine Learning and Applications, IEEE Computing Society
[7] Peng, T., A framework for data cleaning in data warehouses, school of computing, napier
university, 10 Colinton Road, Edinburgh, EH10 5DT, UK
[8] Volkovs. M., Chiang, F., Szlichta J, Ren´ee J. Miller , Continuous Data Cleaning, IBM
Centre for Advanced Studies in Toronto
[9] Pachano, L.A., Khoshgoftaar T.M., Wald, W, (2013), Survey of Data Cleansing and
Monitoring for Large-Scale Battery Backup, Installations, 12th International Conference on
Machine Learning and Applications, IEEE Computing Society
[10] Peng, T., A framework for data cleaning in data warehouses, school of computing, napier
university, 10 Colinton Road, Edinburgh, EH10 5DT, UK
[11] Volkovs. M., Chiang, F., Szlichta J, Ren´ee J. Miller , Continuous Data Cleaning, IBM
Centre for Advanced Studies in Toronto
[12] Galhardas, H. Data Cleaning: Model, Language and Algoritmes. University of Versailles, Saint-Quentin-
En-Yvelines, Ph.D., 2001.