You are on page 1of 12

www.idosr.

org Ezeet al
©IDOSR PUBLICATIONS
International Digital Organization for Scientific Research ISSN: 2550-7931
IDOSR JOURNAL OF APPLIED SCIENCES 7(1) 29-40, 2022.
Review of the Implications of Uploading Unverified Dataset in A Data
Banking Site (Case Study of Kaggle)
Val Hyginus U. Eze1,2,Martin C. Eze1, Chibuzo C. Ogbonna1, Valentine S.
Enyi1,Samuel A. Ugwu1, Chidinma E. Eze3
1
Department of Electronic Engineering,University of Nigeria, Nsukka.
2
Department of Computer Science, Faculty of Science, University of Yaoundé 1,
Cameroon.
3
Department of Educational Foundation and Planning, University of Calabar, Calabar,
Nigeria.
ABSTRACT
This review paper comprehensively detailed the methodologies involved in data
analysis and theevaluation steps. It showed that steps and phases are the two main
methodological parameters to be considered during data assessment for data of high
qualities to be obtained.It is reviewed from this research that poor data quality is always
caused by incompleteness, inconsistency, integrity and time-related dimensions and the
four major factors that causes error in a dataset are duplication, commutative entries,
incorrect values and black entries which always leads to catastrophe. This paper also
reviewed the types of datasets, adopted techniques to ensure good data quality, types
of data measurement and its classifications.Furthermore, the Kaggle site was used as a
case study to show the trend of data growth and its consequences to the world and the
data bankers. It is then deduced that low data quality which is caused as a result of
errors during primary data mining and entries leads to wrong results which bring about
the wrong conclusions. It was advised that critical data quality measures should be
adopted by the data bankers such as Kaggle before uploading the data into their site to
avoid catastrophe and harm to humans.Finally, the outlined solutions as reviewed in
this paper will serve as a guide to data bankers and miners to obtain data of high
quality, fit for use and devoid of a defect.
Keywords: Accuracy, Data Bank,Data Quality, Dataset, Defect, fit for use, Kaggle
INTRODUCTION
The rapid growth of big data attracted system errors and poor data migration
the attention of world researchers processes. The increase in the amount
towards accessing the quality of the of data being used by organizations,
data to be used by an organisation stored in data banks and mined for
inorder to make it fit for a particular competitive useover the last decades led
purpose[1]. However, with the upswing to a research shift from data mining to
of technologies such as cloud data quality maintenance and integrity
computing, the internetand social in order to make it fit for use [4][5]. The
media, the amount of generated data is challenge forquality and quantityis
increasing exponentially[2]. The increasing daily as researchers are
enormous amount of data available in always in need of it for a betteroutput.
many forms and types forces High-quality data is a prerequisite for
organizations to come up with worldwide research harmonization,
innovative ideas to fine-tune data in- global spend analysis and integrated
order to maintain quality[3]. From service management in compliance with
research, it was found out that data are regulatory and legal requirements [6][7].
classified based on their structures. The level of poor-quality data being
Data with less quality are mainly uploaded in some data mining sites has
obtained in Unstructured data from awakened the spirit of researchersin
multiple sources which makes data refining the quality of the data for better
quality management complex. Some results. However, the use of poor-quality
causes of poor data dataisso catastropheand leads to
qualityarecarelessness in data entry, inaccurate results, inefficiency, risk
unverified external web data sources,

29
www.idosr.org Ezeet al
mitigation and wrong decisions that is fit for use. They are said to be fit
making[8]. for use when they are free of defects
and possess the features needed to
Data Quality guide the user in making the right
Dataset involves the collection of decision.It can also be expressed as the
related data items that may be accessed discrepancy between the way data were
individually or in combination or viewed when presented as a piece of
managed as a whole entity. A information in a system and
databasecan be structured to contain itsappearance in real-life
various information about a particular scenarios.Hence, data quality can be
category of relevant to the data generally expressed as the set of data
researcher. A database can also be that is fit for use and free from
considered asa datasetas some entities defects.For data to be certified fit for
within it are related to a particular type use and free from defectsit must be
of information[9][10][11][12]. Data characterized and examined under the
quality is the high-quality information data quality dimension.

Factors that Determinesthe Data Quality


completeness, data consistency and
The key major factors of data quality time-related dimensions [13].
dimensions that determine the fitness of
data to be used are; data accuracy, data
Data Accuracy
Accuracy is one of the data quality object that it aims to represent
dimensionsthat is widely used to rather it focuses on checking the
ascertain the fitness of data for use. correspondence of a data value
Accuracy is the closeness of a data value with respect to any value in the
and its real-lifevalue (real-world domain that defines the data
object)which the data aims to value.
represent.Data accuracy can be  Semantic accuracy is the
classified as either syntactic accuracy or closeness of a data value and the
semantic accuracy. real-world object which it aims to
 Syntactic accuracy is the represent. In order to be able to
closeness of a data value to the measure semantic accuracy, the
elements of the corresponding true value of the real-world
definition domain. In syntactic object must be known for
accuracy, a data value is not comparison [14].
compared to the real-world
Data Completeness
Completeness is the extent to which When it is important to know why a
data are of sufficient breadth, depth and value is missing and also to know when
scope for the task that is meant to solve. a value is represented with NAN or NA.
It isclassified into schema In measuring the completeness of a
completeness, population completeness table, it is important to note the
and column completeness [15]. difference between null values and
 Schema completeness is the incompleteness, NAN and NA.
degree to which concepts and A value is said to null value when it
their properties are not missing possesses either of these characteristics:
from a data schema.  The value is not existing (does
 Population completeness not contribute to
evaluates missing values incompleteness)
withrespect to a reference  The value is existing but not
population. known (contributes to
 Column completeness is defined incompleteness)
as a measure of the missing  It is not known whether the value
values for a specific property or exists (may or may not contribute
column in a dataset. to completeness)

30
www.idosr.org Ezeet al
From this, it can be concluded that data miner is present to ascertain which
incompleteness (missing value) can be type of error existed in that dataset or
detected but incompleteness cannot be the original datasheet is available.
certainly solved accurately unless the
Data Consistency
Data consistency involves semantic rule attribute where such contrary
violation which is expressed as the values do not exist in that
integrityconstraints of dataset contest. For example, a negative
properties that must be satisfied for age in a database represents a
effective performance.There are two person.
fundamental categories of integrity  Inter-relation constraints involve
constraints known as intra and inter- attributes from other relational
relation constraints. databases.For example,an
 Intra-relation constraints define a individual of different ages is
range of admissible values for an identified in two databases.
Time-Related Dimensions of Data
Data quality can be characterized by the on their characteristics are expressed as
ability of the dataset to maintain its tohow often the database was updated
quality after some years of or the time between receiving a data
metamorphosis and also how current it unit and the delivery of the data unit to
is up to date. Most research recognizes a customer.
three closely related time dimensions The relationship between the three time-
such as currency, volatility and related dimensions is mathematically
timeliness. The three time-related expressed as in equation (1).
dimensions and their relationship based
Timeless = max (1)

Data Dimension Measurements


Data Dimension Measurement (DDM) is reflecting or repeating the same errors
one of the important criteria to be multiple times [16]. The simplest
considered in validating the quality of metrics used in obtaining the value of
data. To design the right metrics to be objective measures which is expressed
adopted in measuring data quality is as the ratio of the error line to the total
one of the most challenging tasks of lines of the dataset as shown in
data quality assessment as it equation (2).
shouldidentify all errorswithout
Ratio = (2)
However, the calculation of such ratios without recognizing the importance of
is only possible when there are clear the distinction between subjective and
rules on when an outcome is desirable objective measures and the comparison
or undesirable which forms the basic between themwhich forms the basic
foundation for a good outcome[15].It is input for the identification of data
imperative to note that most methods quality problems. Table 1 showed the
provide only objective measures for differences between objective and
assessing data quality dimensions subjective measures.
Table 1: Objective Versus Subjective Data Quality Measures
Feature Subjective Objective
Measurement tool Software Survey
Measuring target Datum Representational information
Measuring standard Rules, patterns User satisfaction
Process Automated User involved
Results Single Multiple
Data storage Databases Business contexts

31
www.idosr.org Ezeet al
Types of Datasets
The supreme objective of any Data but also images, audio and video clips.
Quality analyst is to analyse and A huge amount of unstructured data is
evaluate the quality of data in easily obtained and also available on the
useaccurately, to ensure that it is fit for internet nowadays in which there is a
use and devoid of a defect. The concept need to analyse and process such data
of data itself involves a digital for future use. The accuracy of
representation of real-world objects that structured data is dependent on the
can be later processed and manipulated final user as the accuracy of structured
by software procedures through a data will be higher when a final
network. Data analysis is the process of consumer is a machine than when it is
combining extracted information and human. Information extraction consists
data mining techniques to achieve of five substantialsub-tasks as
higher quality. Information extraction is segmentation, classification,
a process of populating a database association, normalization, and
fromunstructured or loosely structured reduplication and in statistics,this is
text. Data mining techniques are being calledNumerical data. These data have
applied to discover patterns that can be the meaning as a measurement such as
later assessed for better results. height, weight, IQ, blood pressure.
In the field of information systems, data Statisticians also call numerical
are being grouped as follow: data quantitative data and it is also
 Structured divided into discrete and
 Unstructured continuousnumerical data.
 Semi-structured. a. Discrete data represent items that
In the field of statistics, data are can be counted and also take a possible
alsobeing grouped as: value that can be listed out. The list of
 Categorical possible values may be fixed (finite).
 Numerical b. Continuous data represent
 Ordinal. measurements whose possible values
Categorical or Structured Data cannot be counted and can only be
This is also called structured data in described using intervals on the real
information science which is expressed number line.For example, the exact
asa group of item(s)with its simple amount of gas purchased at the pump
attributes defined within a domain. A for cars with 10 litres tanks would be
Domain is the type of value that can be continuous data from 0 litres to 10 litres
assigned to an attribute. Structured data which can be represented by the interval
is categorically referred to as [0, 20][19].
corresponding programming language Ordinal or Semi-Structured data
data types such as Integer (whole This is called semi-structured data in
numbers), Boolean (true/false), String information science and it is expressed
(sequence of characters). Moreover, in as flexiblystructured data with a semi-
statistics it is called Categorical similar characteristic in comparison
datawhich represents some data with traditional structured data. It is
characteristics such as a person‟s sex, expressed to be flexible due to its
marital status, age etc. Categorical data inability to adopt the formal structure of
can also use numerical values to data models associated with relational
represent strings such as 1to represent databases or data tables. Ordinal data
true and 0 to represent false.The statistically mixes both numerical and
assigned values have no analytical categorical data together. The data fall
values as they cannot be added or into categories, but the numbers placed
subtracted together [17][18]. on the categories have meaning. For
Numerical or Unstructured data example, grading students in an exam
Unstructured data are ungrouped and on a scale from lowest (0) to highest (70)
non-tabulated sets of data presented in scores gives ordinal data. Ordinal data
a well acceptable language such as news are often treated as categorical, where
articles, web pages, blogs, e-mails, etc. the groups are ordered when graphs and
Unstructured data could not only be text charts are made. For example, if you

32
www.idosr.org Ezeet al
survey 100 people and ask them to rate secondary data sources are found in
an exam on a scale from 0-70, taking the some software applications/sites such
average of the 100 responses will have as Jamia, Alibaba, jigjig andKaggle. The
meaningbutvaries with categorical owner of such data might have refined it
data[18] to suit his purpose and finally lunch it
Types of Data Measurement Scale for researchersto consult for
There are two major data measurement information with respect to their
scales used in measuring the accuracy research area.
of calibrated data measuring The problem of secondary data which is
instruments. The two most popular data refining and fine turning of datasets has
measurement scales are ratio and caused more harm than good in the
interval scales. research field. The refined data causes
Interval scales are numerical scale that problems as the refiner will not
is used to indicate the space or categorically state what has been refined
difference between two so that the user will be aware that some
variables/values. It is so important as of the inbuilt microns of that datahave
the realm of statistical analysis on the been altered. This data might be used
datasets are open for measurement and for research purposes even without
analysis. For example,a central tendency rugged analysis since the secondary
can be used to measure mode, mean, user lacks the knowledge of where these
median and standard deviation. The data was sourced from and the
drawback of this is that it does not have authenticity of it but relies on the
a true zero. description of the publisher. In some
Ratio scale is an ultimate nirvana when cases, some values/data uploaded can
it comes to data measurement scales be autogenerated and were prone to
because it gives detailed information errors and manipulations.This has made
about the order and exact unit values secondary data highly unreliable yet
and also has absolute zero which gives doesn‟t discredit the use of secondary
it an edge for measuring a very wide data as it has merits of easier access,
range of values of descriptive and aids in making research faster and it is
inferential statistics [20]. generally cheaper thana primary source
Classification of Data of data.
Data classification involves the methods Data Quality Assessment
and patterns at which data were Configurations
collected and organized for use. Data There have been difficulties in
can be broadly classified based on ascertaining the best data quality
where and how they are gotten(sourced) configurations to be adopted during the
which are; primary and secondary data data processing.Some researchers
sources. proposed a dynamic configurational
1.Primary Data:A directly sourced data data quality assessment type whose
that has not been altered, refined or input configuration is generic data
published isknown as primary containingcritical activities. This generic
data.Primary data could be sourced assessment technique was formulated
from experiments, one on one by harnessing and grouping activities
interviews, questionnaires etc. An from selected data quality assessment
example of such isdata obtained from methodologies.The order and
renewable energy research on the effect dependencies between activities were
of dust in the solar photovoltaic panel defined based on the input and the
that was performed practically and data output of the specific activities.Generic
collected by the researcher. data quality process and configuration
2. Secondary data:Secondhand methods were defined to have the
information or refined or altered data following characteristics:
collected and recorded for future use is 1. Determining the aim of the
known as secondary data. It can also be assessment and requirements related to
a piece of firsthand information that the data quality assessment. The
was recorded by the researcher for assessment status of the accessed
future use by other researchers. These processes is so important to be stated

33
www.idosr.org Ezeet al
by the data quality assessors to guide 3.Configure and arrange activities in a
users on how, when and where a sensible order and include activity
particular method should be usedfor dependencies.
fitness. Figure 1 showed the generic data quality
2.Select the activities from the generic assessment processes adopted by
process model that contribute to the different data quality assessors to
assessment aim and requirements ensure that the data in use is fit for the
purpose.

Comparative Analysis of Data Quality Assessment Methodologies


Figure 1: A generic data quality assessment process
This section of the research paper The methodology comparison measured
reviewed a comparative analysis of data on phases and steps is of utmost
quality assessmenttechniquesthat interest ina comprehensive review of the
existed as of the time of this existing methods as phases and
compilation with the aim of identifying stepswere so important in ascertaining
critical activities in dataquality.The data quality.Table 2 showed the most
thirteen existingdataquality assessment effective and organised steps
and improvement methods based on recognized in data quality assessment
steps, strategies, techniques, processes. The table further analysed
dimensions and types of information are data based on sources and related
as shown below: problems associated with each method
 The methodological phases and used indata sourcing. There are set up
steps rules used by schemas to examine and
 The strategies and techniques interview clients in order to improve
 The data quality dimensions and performance and achieve a complete
metrics understanding of data and its related
 The types of data in use architectural management. The
 The types of information systems following steps are incorporated in
in use [21]. order to obtain effective and reliable
data. Important steps to be adopted for

34
www.idosr.org Ezeet al
better quality data output to be flow in a database for easy location and
achieved are as outlined below: access.
1. Data quality requirement 3.Process modelling:This is the process
analysis:This is a technique where the of validating and updating produced
opinion of data users and data for fitness
administrators are surveyed and 4. Measurement of quality:This
sampled to identify data quality issues involves the use of different data quality
and set new quality targets. dimensions, the problems associated
2. Identification of critical areas:This with each dimension quality and the
involves the identifications of the definition of the dimensional metrics to be
relevant areas and the sequence of data
used.

Table 2 Comparison of Methodologies and their Assessment steps


Step/meth Data DQ Identification Process Measurement Extensible to
acronym analysis requirement of critical modelling of quality other
analysis areas dimensionsand
matrices
TDQM + + + + Fixed
DWQ + + + + Open
TIQM + + + + + Fixed
AIMQ + + + Fixed
CIHI + + Fixed
DQA + + + Open
IQM + + Open
ISTAT + + Fixed
AMEQ + + + + Open
COLDQ + + + + + Fixed
DaQuinCIS + + + + Open
QAFD + + + + Fixed
CDQ + + + + + Open
From this comprehensive review of „O‟ instead of 0 in digits [22] with the
methodologies, it was observed that no human eye, it will be difficult to detect.
method is the best or superior to others Blank entries: This is mostly found in
for data quality assessment but there structured CSV/excel and Relational
are steps and phases to combine or databases. For example, age in some
follow to obtain a very high output. cases, can‟t be a primary key, and as
Causes of Error in Dataset such blank space can be permitted. This
The causes of error in a dataset that is not appropriate especially when the
always leads to low data quality are as researcher or user needs accurate age
detailed below: for research work. Every missing or
Commutative entries: This is a common blank space in a dataset is necessary to
issue where values are wrongly placed be filled with a value that will not
in an attribute that was not meant for it. change/negate the quality and
It is a fake entry that might be as a characteristics of that dataset.
result of omission or sourcing data from Duplication: This can be seen in some
a means that are not fit for its purpose data entries where a row or column
and to detect such error is very difficult entry was duplicated once or more. It
especially in cumbersome data. mainly occurs where data are
Incorrect values: This involves automatically generated and leads to
entering values either out of range or false outcomes.
incorrectly which can cause grave errors Causes of Low Data Quality
in calculation. For instance, using Inconsistency:This involves semantic
numeric values for numeric attributes, rule violation which is expressed as the

35
www.idosr.org Ezeet al
integrity constraints of a dataset unable to check and ascertain the degree
property that must be satisfied for of the data correctness, completeness,
effective performance. integrity and accuracy together with the
Incompleteness:This is the extent to quality of data to be uploaded in the
which data are insufficient in breadth, Kaggle site even though the data owners
depth and scope for the task that is have a descriptive space where to write
meant to solve. a brief detail of the uploaded data has
Inaccuracy:The rate of variance of the caused the major concern. This is so
closeness of a data value and its real-life because many of the data uploaded in
value (real-world object) which the data the site were refined and altered by the
aims to represent. data miner without stating exactly what
Duplication: This occurs when the same was altered and why it was refined in
data with the same characteristics are the description space
being enteredonce or more in a row or Kaggle Data Qualities
column. Kaggle is a worldwide site where huge
Blank Entries: This is a process where a and bulk data were obtained for
column or row is total empty where scientific and machine learning research
there is a certainty of having a value. purposes. For this site to attract more
Incorrect values: This involves using a customers and researchers to be visiting
wrong symbol that resembles in place of and always confide on the uploaded
anothercorrect one. data in the site certain quality assurance
Commutative Entries: This is a common will be needed by the site users such as
issue where values are wrongly placed the quality of data to be uploaded and
in an attribute that was not meant for it. the sources of the data. However,
Furthermore, many data bankers like because of the popularity of the Kaggle
Kaggle, Jumia, Konga etc should be site, several data are being uploaded
cognizant of these aforementioned without proper debugging and cleaning
causes and characteristics of poor data which degrades the quality of data on
quality and ensure that data uploaded in the site. This aforementioned drawback
their sites should be devoid of such led to ongoing data quality research on
casualty. Kaggle and other data banking sites.
Kaggle Data Site Data quality can generally be expressed
Kaggle is an online data science as a data‟s fitness for use and also its
community where passionate data ability to meet up the purpose set by the
science researcherslearn and exchange data user. This definition simply
ideas [23]. Kaggle webpage can be showed that the quality of data is highly
accessed via https://www.kaggle.com/. dependent on the context of the data
Most researchers more especially data userin synergy with the customer needs,
scientists findthe Kaggle site very useful ability to use and ability to access data
for data sourcing even though it has it's at the right time and at the right
own prone and cons[24][25]. Since it is a location. However,the data quality
community, it encourages sharing assessment and improvement
datasets publicly among users. Datasets processare not limited to the primary
can also be private which enables the data miners only but also to the data
user to use them privately. It can be users and other data stakeholders that
shared personallywith another user are involved during data entry, data
alongside its security key. There are processing and data analysis [27]. Unlike
stipulated guidelines on how to create the conventional method of data
datasets, rate datasets, ask for division and analysis where synergies
collaboration and in general how to use and coherency among the data
these datasets [26]. Datasets in Kaggle managers were not inculcated and the
appears on different data formats, bridge between them seriously affects
depending on which format the users the quality of uploaded data. In Kaggle
deem fit to use. These formats range data quality is divided into four
from CSV, JSON, xlsx, Archives to subgroups known as data quality for a
BigQuery, with CSV as the most available website, data quality for decision
and used format. The challenge of being support, data quality assessment and

36
www.idosr.org Ezeet al
other data quality applications such as comprehensive analysis of publications
medical data quality and software from 1994-1995 was proposed. The
development processes for engineers. authors compared data and data quality
Furthermore, the study has proofed that to be a manufactured physical product
data quality assurance is the major and its quality and relates it to
problem ofthe Kaggle site. Many managing data quality from established
researchers have studied the data‟s in concepts to managing the quality of
Kaggle and discovered that at first when physical products. Moreover, research
data were not as huge as in Tera Bytes was still ongoing as the quantity of data
the challenges was limited to in Kaggle kept on growing every minute
completeness and accuracy.However, as of the day which lead to a summary of
data began to grow in size and quantity data quality research using articles
the research and challenges which is published between 1995 and 2005[31].
peculiar to the field of study and However, from reviewed Kaggle data and
research area shifted from two to five in others, relationships werederived based
the field of data management and on the judgment and intuition of the
otherswhichare completeness, researchers‟ present conceptual
correctness, consistency, plausibility, assessment of data quality and its
and currency. Therefore, accuracy, management forexcellent results.
completeness and consistency mostly According to[34], data quality research
affect the capacity of data to support can be classified categorically based on
research conclusions and therefore form assessment, management, and
the most important area of research in contextual aspects of data quality. Data
the data mining sector[28]. quality as a novel framework that
Trends and Importance of Data combined the factors of fitness for
Quality Assurance in Kaggle useas defined in [35] and with the
There have been numerous attempts to incorporation of management elements
summarize the research on data quality as defined in[30] is comprehensively
trends in Kagglebased on the best reviewed in [32]. Furthermore, many
methodology to adopt in examining the topics on methods to categorize data
quality of data but all to no avail. Then, quality research and to develop a
after the industrial revolution, the framework that allows researchers to
amount of information dominated by characterize their research was also
characters doubled every ten years. reviewed on[33]and it showed different
After 1970, the amount of information taxonomy and methods at which data
doubled every three years. Today, the can be examined for quality assurance.
global amount of information can be In summary, as stated earlier, the data
doubled every two years. In 2011, the quality research in Kaggle has grown to
amount of global data created and a critical juncture that has attracted the
copied reached 1.8 ZB. It is difficult to interest of researchers and the world at
collect, clean, integrate, and finally large. From 1907 to 2010, data quality
obtain the necessary high-quality data was considered somewhat anecdotal and
within a reasonable time frame because esoteric. But as of 2021, it is considered
of the proportion of the high valuable and relevant because of the
unstructured data in a big dataset which importance of datato researchers, data
takes a lot of time to be transformedinto scientists and the world at large. Due to
a structured type of dataset and further the tremendous upload of data and
processed. This is a great challenge to bigdata in Kaggle, the research area has
the existing techniques of data quality witnessed extraordinary growth in
processing and management [29]. thelast five years and as a result of that
The first seminar data quality research Kaggle site serves as one of the best
topic was introduced by [30]in the year data banks for researchers.Many data
1995. Sincethen, comprehensive bankers like Kaggle, Jumia, jigjig etc will
research was going on to summarize, benefit so much from this review paper
classify, and develop frameworks for as it will help them in configuring data
Data Quality research.[31][32][33]. In to be uploaded to their site based on the
[30] a framework of data quality from a

37
www.idosr.org Ezeet al
reviewed methodologies for high output this review paper and it detailly
to be achieved. reviewedthe different errors
CONCLUSION encountered by Kaggle and highlighted
This review showed the different the causes of it. This reviewwill serve as
methods and techniques that can be a guide to data scientists, primary data
adopted by researchers to ensure that miners and data bankers on the best
data of high quality is obtained. It also ways to handle their data to be devoid
detailed the types of data, data quality of defect, error and also to make it fit
assurance, causes of poor data quality for use at all times. Finally, this paper
and the consequences. It showed the will assist data bankers in checking for
different domains at which data exists quality of data, data quality assurance
and also extensively detailed the types and the authenticity of the data from
and classes of datasets that exist and the primary miners before uploading it
their importance to data users. Kaggle into their sites to avoid misleading
data site was used as a case study in researchers or data users.

REFERENCES
[1] “Abbasi, A., Sarker, S., & Chiang, 32.
R. H. L. (2016). Big Data Research [8] “Friedman, T., & Smith, M. (2011).
in Information Systems: Toward Measuring the Business Value of
an Inclusive Research Agenda. Data Quality. Gartner,
Journal of the Association for (G00218962).”
Information Systems, 17(2), 1-33. [9] TechTarget, “data set,”
[2] “Cai, L., & Zhu, Y. (2015). The www.whatis.com,
Challenges of Data Quality and https://whatis.techtarget.com/def
Data Quality Assessment in the inition/data-set2020
Big Data Era. Data Science Journal, [10]
14(0), 2. https://blog.ldodds.com/2013/0
[3] “Albala, M. (2011). Making Sense 2/09/what-is-a-dataset/Ldodds,
of Big Data in the Petabyte Age. “Definitions of Dataset6,”
Cognizant 20-20 Insights www.blog.ldodds.com,2013
Executive Summary,” p. 2011, [11] byjus.com,” 2019.
2011. https://byjus.com/maths/data-
[4] “Cappiello, C., Francalanci, C., & sets/.
Pernici, B. (2004). Data quality [12] “J Uran, J. M., & Godfrey, A. B.
assessment from the user‟s (1998). Juran‟s Quality Handbook.
perspective. Proceedings of the McGraw-Hill. McGraw Hill.
2004 International Workshop on [13] “Scannapieco, M., & Catarci, T.
Information Quality in (2002). Data Quality under the
Information Systems, 68–73. Computer Science perspective.
[5] “Hevner, A. R., & Chatterjee, S. Computer Engineering, 2(2), 1–
(2010). Introduction to Design 12,” vol. 2, p. 2002, 2002.
Science Research. In Design [14] “Batini, C., & Scannapieco, M.
research in information systems: (2006). Data Quality Concepts,
Theory and practice (Vol. 28, pp. Methodologies and Techniques.”
1–8). [15] “Pipino, L. L., Lee, Y. W., & Wang,
[6] “Kagermann, H., Österle, H., & R. Y. (2002). Data quality
Jordan, J. M. (2011). IT-Driven assessment. Communications of
Business Models: Global Case the ACM, 45(4),vol. 45, p. 506010,
Studies in Tranformations. John 2002.
Wiley & Sons.” [16] “del Pilar Angeles, M., & García-
[7] “Shankaranarayan, G., Ziad, M., & Ugalde, F. J. (2009). A Data Quality
Wang, R. Y. (2003). Managing Data Practical Approach. International
Quality in Dynamic Decision Journal on Advances in Software,
Environments: An information 2(3).,” vol. 2, p. 2009, 2009.
product approach. Journal of [17] T. Kupta, “Types of Data Sets in
Database Management, 14(4), 14– Data Science, Data Mining &

38
www.idosr.org Ezeet al
Machine Learning,” www. @ tasets. [Accessed 2
towardsdatascience.com, 2019. November2020].”
https://towardsdatascience.com/t [27] “A Review of Data Quality
ypes-of-data-sets-in-data-science- Research in Achieving,” Journal of
data-mining-machine-learning- Theoretical and Applied
eb47c80af7a. Information Technology, vol. 95,
[18] K. Nandi, “Explore Your Data: no. 12, pp. 1–12, 2017.
Cases, Variables, Types of [28] N. Z. Meredith, W. E. Hammond, B.
Variables,” www.web@ G. Beverly, and M. G. Kahn,
makemeanalyst.com, 2018. “Assessing Data Quality for
http://makemeanalyst.com/basic- Healthcare Systems Data Used in
statistics-explore-your-data-cases- Clinical Research,” Health Care
variables-types-of-variables/. System Research Collaboratory,
[19] N. Donges, “Data Types in vol. 1, pp. 1–26, 2014.
Statistics,” [29] L. Cai and Y. Zhu, “The Challenges
www.towardsdatascience.com, of Data Quality and Data Quality
2018. Assessment in the Big Data Era,”
https://towardsdatascience.com/ Data Science Journal, vol. 12, no.
data-types-in-statistics- 2, pp. 1–10, 2020.
347e152e8bee. [30] “Wang, R. Y., Storey, V. C., and
[20] L. L, “Types of Data & Firth, P. 1995. „A Framework for
Measurement Scales: Nominal, Analysis of Data Quality
Ordinal, Interval and Ratio,” Research,‟ IEEE Transactions on
www.mymarketresearchmethods.c Knowledge and Data Engineering
om, 2020. (7), pp. 623-640.,” no. 7, p. 1995,
[21] “Batini, C., Cappiello, C., 1995.
Francalanci, C., & Maurino, A. [31] “Lima, L., Maçada, G., and Vargas,
(2009). Methodologies for data L.M. 2006. „Research into
quality assessment and information quality: A study of
improvement. ACM Computing the state-of-the-art in IQ and its
Surveys, 41(3), 1–52. vol. 41, p. consolidation,‟ in Proceedings of
1541883, 2009. the International Conference on
[22] “P. Bhandari, „Scribbr-What is a Information Quality, Cambridge,
ratio scale of measurement,‟ MA.,” p. 2006, 2006.
Scribbr [Online]. Available: [32] “Neely, M. P. and Cook, J. 2008. „A
https://www.scribbr.com/statistic Framework for Classification of
s/ratio-data/. [Accessed 28 the Data and Information Quality
October 2020].” Literature and Preliminary Results
[23] “Wikipedia, „Kaggle,‟ Wikipedia, (1996-2007),‟ in Americas
[Online]. Available: Conference on Information
https://en.wikipedia.org/wiki/Kag Systems (AMCIS), Toronto, CA,”
gle. [Accessed 9 October 2020].,” 2008.
no. October 2020. [33] “Madnick, S., Wang, R. Y., and Lee,
[24] “„Quora - How appropriate it is to Y. W. 2009. „Overview and
publish research-based on Kaggle Framework for Data and
competitions data on ArXiv,‟ Information Quality Research,‟
Quora, 25 January 2016. [Online]. ACM Journal of Information and
Available: https://qr.ae/pNkBBI. Data Quality, (1), pp. 1-22.,” no. 1,
[Accessed 30 October 2020].” 2009.
[25] “V. Nikulin, „An On the Method for [34] “Ge, M. and Helfert, M. 2007. „A
Data Streams Aggregation to Review of Information Quality
Predict Shoppers Loyalty,‟ IJCNN, Research,‟ in Proceedings of the
pp. 1454-1461, 12 July 2015.,” no. International Conference on
July, p. 2015, 2015. Information Quality, Cambridge,
[26] “Kaggle, „Kaggle Datasets,‟ Kaggle, MA.,” p. 2007, 2007.
[Online]. Available: [35] “Juran, J. M. and Godfrey, A. B.
https://www.kaggle.com/docs/da 2000. Juran‟s Quality Handbook,

39
www.idosr.org Ezeet al
McGraw Hill International Series, 5th Edition, 2000.
Editions: Industrial Engineering

40

You might also like